date:20220103

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

I wonder if there isn't a better place for this discussion?

As you point out, there are many threads and many of the points are rather
contentious technically. That will make them even harder to follow in an
email thread.

We could just use the wiki and format the text in the form of questions
with alternative positions.

Or we could use an open google document with similar form.

What's the preference here?



On Mon, Jan 3, 2022 at 7:34 PM Paul Rogers  wrote:

> Hi Charles,
>
> The material is rather dense and benefits from the Github formatting. To
> preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki
> page.
>
> For now, the link to the discussion is [1]. Since the Wiki is not good for
> discussions, let's have that discussion here (if anyone is up to tackling
> such a weighty subject.)
>
> Thanks,
>
> - Paul
>
> [1] https://github.com/apache/drill/pull/2412
>
> On Mon, Jan 3, 2022 at 5:15 PM Charles Givre  wrote:
>
> > @Paul,
> > Do you mind if I copy the contents of your response to DRILL-8088 to this
> > thread?   There's a lot of good info there, and I'd hate to see it get
> lost.
> > -- C
> >
> > > On Jan 3, 2022, at 7:41 PM, Paul Rogers  wrote:
> > >
> > > Hi All,
> > >
> > > Thanks Charles for dredging up that old discussion, your memory is
> better
> > > than mine! And, thanks Ted for that summary of MapR history. As one of
> > the
> > > "replacement crew" brought in after the original folks left, your
> > > description is consistent with my memory of events. Moreover, as we
> > looked
> > > at what was needed to run Drill in production, an Arrow port was far
> down
> > > on the list: it would not have solved actual customer problems.
> > >
> > > Before we get too excited about Arrow, I think we should have a
> > discussion
> > > about what we want in an internal storage format. I added a long
> (sorry)
> > > set of comments in that PR that Charles mentioned that tries to debunk
> > the
> > > myths that have grown up around using a columnar format as the internal
> > > representation for a query engine. (Columnar is great for storage.) The
> > > note presents the many issues we've encountered over the years that
> have
> > > caused us to layer ever more code on top of vectors to solve various
> > > problems. It also highlights a distributed-systems problem which
> vectors
> > > make far worse.
> > >
> > > Arrow is meant to be portable, as Ted discussed, but it is still
> > columnar,
> > > and this is the source of endless problems in an execution engine. So,
> we
> > > want to ask, what is the optimal format for what Drill actually does?
> I'm
> > > now of the opinion that Drill might actually better benefit  from a
> > > row-based format, similar to what Impala uses. The notes even paint a
> > path
> > > forward.
> > >
> > > Ted's description of the goal for Demio suggests that Arrow might be
> the
> > > right answer for that market. Drill, however, tends to be used to query
> > > myriad data sources at scale and as a "query integrator" across
> systems.
> > > This use case has different needs, which may be better served with a
> > > row-based format.
> > >
> > > The upshot is that "value vectors vs. Arrow" is the wrong place to
> start
> > > the discussion. The right place is "what does our many years of
> > experience
> > > with Drill suggest is the most efficient format for how Drill is
> actually
> > > used?"
> > >
> > > Note that Drill could have an Arrow-based API independent of the
> internal
> > > format. The quote from Charles explains how we could do that.
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning 
> > wrote:
> > >
> > >> Christian,
> > >>
> > >> Your thoughts are very helpful. I find Arrow very nice (I use it in
> > Agstack
> > >> with Julia and Python).
> > >>
> > >> I don't think anybody is saying that Drill wouldn't be well set with a
> > >> switch to Arrow or even just interfaces to Arrow. But it is a lot of
> > work
> > >> to make it all happen.
> > >>
> > >>
> > >>
> > >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix 
> wrote:
> > >>
> > >>> Hi Charles, Ted, and the others here,
> > >>>
> > >>> it is very interesting to hear the evolution of Drill, Dremio and
> Arrow
> > >> in
> > >>> that context and thank you Charles for restarting that discussion.
> > >>>
> > >>> I think, and James mentioned this in the PR as well, that Drill could
> > >>> benefit from the continues progress, the Arrow project has made since
> > its
> > >>> separation from Drill. And the arrow Community seems to be large, so
> i
> > >>> assume this goes on and on with improvements, new features, etc. but
> i
> > >> have
> > >>> not enough experience in Drill internals to have an Idea in which
> mass
> > of
> > >>> refactoring this would lead.
> > >>>
> > >>> In addition to that, im not aware of the current roadmap of Arrow and
> > if
> > >>> these would fit into Drills roadmap. Maybe Arrow would go into a
> > >> different
> > >>> direction

[GitHub] [drill] lgtm-com[bot] commented on pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin

2022-01-03 Thread GitBox



lgtm-com[bot] commented on pull request #2419:
URL: https://github.com/apache/drill/pull/2419#issuecomment-1004533251


   This pull request **fixes 1 alert** when merging 
661a16cc28339ba7e6fee47195301a77d030a583 into 
fa2cb0f4937c0d8e797a675d8d6c13c316e48d4c - [view on 
LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-5b136f7448b97826cf5f30a9248832ea4b7f2f3b)
   
   **fixed alerts:**
   
   * 1 for Result of multiplication cast to wider type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Paul Rogers

Hi Charles,

The material is rather dense and benefits from the Github formatting. To
preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki page.

For now, the link to the discussion is [1]. Since the Wiki is not good for
discussions, let's have that discussion here (if anyone is up to tackling
such a weighty subject.)

Thanks,

- Paul

[1] https://github.com/apache/drill/pull/2412

On Mon, Jan 3, 2022 at 5:15 PM Charles Givre  wrote:

> @Paul,
> Do you mind if I copy the contents of your response to DRILL-8088 to this
> thread?   There's a lot of good info there, and I'd hate to see it get lost.
> -- C
>
> > On Jan 3, 2022, at 7:41 PM, Paul Rogers  wrote:
> >
> > Hi All,
> >
> > Thanks Charles for dredging up that old discussion, your memory is better
> > than mine! And, thanks Ted for that summary of MapR history. As one of
> the
> > "replacement crew" brought in after the original folks left, your
> > description is consistent with my memory of events. Moreover, as we
> looked
> > at what was needed to run Drill in production, an Arrow port was far down
> > on the list: it would not have solved actual customer problems.
> >
> > Before we get too excited about Arrow, I think we should have a
> discussion
> > about what we want in an internal storage format. I added a long (sorry)
> > set of comments in that PR that Charles mentioned that tries to debunk
> the
> > myths that have grown up around using a columnar format as the internal
> > representation for a query engine. (Columnar is great for storage.) The
> > note presents the many issues we've encountered over the years that have
> > caused us to layer ever more code on top of vectors to solve various
> > problems. It also highlights a distributed-systems problem which vectors
> > make far worse.
> >
> > Arrow is meant to be portable, as Ted discussed, but it is still
> columnar,
> > and this is the source of endless problems in an execution engine. So, we
> > want to ask, what is the optimal format for what Drill actually does? I'm
> > now of the opinion that Drill might actually better benefit  from a
> > row-based format, similar to what Impala uses. The notes even paint a
> path
> > forward.
> >
> > Ted's description of the goal for Demio suggests that Arrow might be the
> > right answer for that market. Drill, however, tends to be used to query
> > myriad data sources at scale and as a "query integrator" across systems.
> > This use case has different needs, which may be better served with a
> > row-based format.
> >
> > The upshot is that "value vectors vs. Arrow" is the wrong place to start
> > the discussion. The right place is "what does our many years of
> experience
> > with Drill suggest is the most efficient format for how Drill is actually
> > used?"
> >
> > Note that Drill could have an Arrow-based API independent of the internal
> > format. The quote from Charles explains how we could do that.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning 
> wrote:
> >
> >> Christian,
> >>
> >> Your thoughts are very helpful. I find Arrow very nice (I use it in
> Agstack
> >> with Julia and Python).
> >>
> >> I don't think anybody is saying that Drill wouldn't be well set with a
> >> switch to Arrow or even just interfaces to Arrow. But it is a lot of
> work
> >> to make it all happen.
> >>
> >>
> >>
> >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:
> >>
> >>> Hi Charles, Ted, and the others here,
> >>>
> >>> it is very interesting to hear the evolution of Drill, Dremio and Arrow
> >> in
> >>> that context and thank you Charles for restarting that discussion.
> >>>
> >>> I think, and James mentioned this in the PR as well, that Drill could
> >>> benefit from the continues progress, the Arrow project has made since
> its
> >>> separation from Drill. And the arrow Community seems to be large, so i
> >>> assume this goes on and on with improvements, new features, etc. but i
> >> have
> >>> not enough experience in Drill internals to have an Idea in which mass
> of
> >>> refactoring this would lead.
> >>>
> >>> In addition to that, im not aware of the current roadmap of Arrow and
> if
> >>> these would fit into Drills roadmap. Maybe Arrow would go into a
> >> different
> >>> direction than Drill and what should we do, if Drill is bound to Arrow
> >> then?
> >>>
> >>> On the other hand, Arrow could help Drill to a wider adoption with
> >> clients
> >>> like pyarrow, arrow-flight, various other programming languages etc.
> and
> >>> (im not sure about that) maybe its a performance benefit if Drill use
> >> Arrow
> >>> to read Data from HDFS(example), useses Arrow to work with it during
> >>> execution and gives the vectors directly to my Python(example) programm
> >> via
> >>> arrow-flight so that i can Play around with Pandas, etc.
> >>>
> >>> Just some thoughts i have since i have used Dremio with pyarrow and
> Drill
> >>> with odbc connections.
> >>>
> >>> Regards
> >>> Christian
> >>>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Charles Givre

@Paul, 
Do you mind if I copy the contents of your response to DRILL-8088 to this 
thread?   There's a lot of good info there, and I'd hate to see it get lost.
-- C

> On Jan 3, 2022, at 7:41 PM, Paul Rogers  wrote:
> 
> Hi All,
> 
> Thanks Charles for dredging up that old discussion, your memory is better
> than mine! And, thanks Ted for that summary of MapR history. As one of the
> "replacement crew" brought in after the original folks left, your
> description is consistent with my memory of events. Moreover, as we looked
> at what was needed to run Drill in production, an Arrow port was far down
> on the list: it would not have solved actual customer problems.
> 
> Before we get too excited about Arrow, I think we should have a discussion
> about what we want in an internal storage format. I added a long (sorry)
> set of comments in that PR that Charles mentioned that tries to debunk the
> myths that have grown up around using a columnar format as the internal
> representation for a query engine. (Columnar is great for storage.) The
> note presents the many issues we've encountered over the years that have
> caused us to layer ever more code on top of vectors to solve various
> problems. It also highlights a distributed-systems problem which vectors
> make far worse.
> 
> Arrow is meant to be portable, as Ted discussed, but it is still columnar,
> and this is the source of endless problems in an execution engine. So, we
> want to ask, what is the optimal format for what Drill actually does? I'm
> now of the opinion that Drill might actually better benefit  from a
> row-based format, similar to what Impala uses. The notes even paint a path
> forward.
> 
> Ted's description of the goal for Demio suggests that Arrow might be the
> right answer for that market. Drill, however, tends to be used to query
> myriad data sources at scale and as a "query integrator" across systems.
> This use case has different needs, which may be better served with a
> row-based format.
> 
> The upshot is that "value vectors vs. Arrow" is the wrong place to start
> the discussion. The right place is "what does our many years of experience
> with Drill suggest is the most efficient format for how Drill is actually
> used?"
> 
> Note that Drill could have an Arrow-based API independent of the internal
> format. The quote from Charles explains how we could do that.
> 
> Thanks,
> 
> - Paul
> 
> On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning  wrote:
> 
>> Christian,
>> 
>> Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
>> with Julia and Python).
>> 
>> I don't think anybody is saying that Drill wouldn't be well set with a
>> switch to Arrow or even just interfaces to Arrow. But it is a lot of work
>> to make it all happen.
>> 
>> 
>> 
>> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:
>> 
>>> Hi Charles, Ted, and the others here,
>>> 
>>> it is very interesting to hear the evolution of Drill, Dremio and Arrow
>> in
>>> that context and thank you Charles for restarting that discussion.
>>> 
>>> I think, and James mentioned this in the PR as well, that Drill could
>>> benefit from the continues progress, the Arrow project has made since its
>>> separation from Drill. And the arrow Community seems to be large, so i
>>> assume this goes on and on with improvements, new features, etc. but i
>> have
>>> not enough experience in Drill internals to have an Idea in which mass of
>>> refactoring this would lead.
>>> 
>>> In addition to that, im not aware of the current roadmap of Arrow and if
>>> these would fit into Drills roadmap. Maybe Arrow would go into a
>> different
>>> direction than Drill and what should we do, if Drill is bound to Arrow
>> then?
>>> 
>>> On the other hand, Arrow could help Drill to a wider adoption with
>> clients
>>> like pyarrow, arrow-flight, various other programming languages etc. and
>>> (im not sure about that) maybe its a performance benefit if Drill use
>> Arrow
>>> to read Data from HDFS(example), useses Arrow to work with it during
>>> execution and gives the vectors directly to my Python(example) programm
>> via
>>> arrow-flight so that i can Play around with Pandas, etc.
>>> 
>>> Just some thoughts i have since i have used Dremio with pyarrow and Drill
>>> with odbc connections.
>>> 
>>> Regards
>>> Christian
>>>  Original-Nachricht 
>>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
>>> 
>>> 
>>> Thanks Ted for the perspective! I had always wished to be a "fly on the
>>> wall" in those conversations. :-)
>>> -- C
>>> 
 On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
 
 Hello all,
 There was a discussion in a recently closed PR [1] with a discussion
>>> between z0ltrix, James Turton and a few others about integrating Drill
>> with
>>> Apache Arrow and wondering why it was never done. I'd like to share my
>>> perspective as someone who has been around Drill for some time but also
>> as
>>> someone who never worked for MapR or Dremio. This just

[GitHub] [drill] paul-rogers edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and 
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is 
why Parquet uses it, Druid uses it for segment files, and various DBs use it 
for storage. The question we want to ask is, do those benefits apply to the 
format within the Drill execution engine? I'm here to suggest that columnar has 
no advantage, and many disadvantages, when used as the *internal* format of an 
execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill 
(columnar) and Impala (row-based) are two well-known examples. This has given 
me a unique opportunity to see if all the marketing claims for columnar (which 
still appear in the videos on Drill's website) actually hold up in practice. 
Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start 
with the biggest issues, then work toward the details. So, rather than tinker 
with the details of vector execution, let's look at the fundamental issues. I 
hope this will help us avoid confusing Drill's (and Arrow's) marketing with 
reality.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized 
execution. Of course, a minor myth is that Drill uses such execution (it 
doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a 
large amount of data. Think the dot-product operation for neural networks, or 
data compression, or image transforms, or graphics. In all cases, we apply a 
simple operation (rescale, say) to a large amount of homogeneous data (the 
pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now 
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL 
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed 
business logic, and that logic is based on messy human rules, not crisp, clean 
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, 
etc.), lots of function calls (all those cool UDFs which @cgivre has written), 
and so on. Plus, as noted above, SQL deals with NULL values, which must 
short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most 
queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that 
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The 
idea is that one can load a vector into the L1 cache, then zip through many 
values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. 
Each calculation tends to involve multiple columns: `net_receipts = sales + 
taxes - returns`. Here each calculation touches four vectors, so we need all 
four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many 
that occur on each row. In the ideal case, the calculations for independent 
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we 
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load 
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this 
way (it does all the calculations for a single row before moving to the next), 
but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: 
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must 
load the `a` vector into the L1 cache multiple times. (Or, more properly, its 
values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone 
bought into the "L1 cache" myth. To get better performance everyone wanted ever 
larger vectors. In the code, you'll see that we started with 1K-row batches, 
then it grew to 4K, then other code would create 64K row batches. It got so bad 
we'd allocate vectors larger than 16MB, which caused memory fragmentation and 
OOM errors. (This is the original reason for what evolved to be "EVF": to 
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is 
already 16x the L1 cache size. Combine that with real-world expressions and we 
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache 
size. The result is lots of cache misses. (This stuff is really hard to 
measure, would be great

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Paul Rogers

Hi All,

Thanks Charles for dredging up that old discussion, your memory is better
than mine! And, thanks Ted for that summary of MapR history. As one of the
"replacement crew" brought in after the original folks left, your
description is consistent with my memory of events. Moreover, as we looked
at what was needed to run Drill in production, an Arrow port was far down
on the list: it would not have solved actual customer problems.

Before we get too excited about Arrow, I think we should have a discussion
about what we want in an internal storage format. I added a long (sorry)
set of comments in that PR that Charles mentioned that tries to debunk the
myths that have grown up around using a columnar format as the internal
representation for a query engine. (Columnar is great for storage.) The
note presents the many issues we've encountered over the years that have
caused us to layer ever more code on top of vectors to solve various
problems. It also highlights a distributed-systems problem which vectors
make far worse.

Arrow is meant to be portable, as Ted discussed, but it is still columnar,
and this is the source of endless problems in an execution engine. So, we
want to ask, what is the optimal format for what Drill actually does? I'm
now of the opinion that Drill might actually better benefit  from a
row-based format, similar to what Impala uses. The notes even paint a path
forward.

Ted's description of the goal for Demio suggests that Arrow might be the
right answer for that market. Drill, however, tends to be used to query
myriad data sources at scale and as a "query integrator" across systems.
This use case has different needs, which may be better served with a
row-based format.

The upshot is that "value vectors vs. Arrow" is the wrong place to start
the discussion. The right place is "what does our many years of experience
with Drill suggest is the most efficient format for how Drill is actually
used?"

Note that Drill could have an Arrow-based API independent of the internal
format. The quote from Charles explains how we could do that.

Thanks,

- Paul

On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning  wrote:

> Christian,
>
> Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
> with Julia and Python).
>
> I don't think anybody is saying that Drill wouldn't be well set with a
> switch to Arrow or even just interfaces to Arrow. But it is a lot of work
> to make it all happen.
>
>
>
> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:
>
> > Hi Charles, Ted, and the others here,
> >
> > it is very interesting to hear the evolution of Drill, Dremio and Arrow
> in
> > that context and thank you Charles for restarting that discussion.
> >
> > I think, and James mentioned this in the PR as well, that Drill could
> > benefit from the continues progress, the Arrow project has made since its
> > separation from Drill. And the arrow Community seems to be large, so i
> > assume this goes on and on with improvements, new features, etc. but i
> have
> > not enough experience in Drill internals to have an Idea in which mass of
> > refactoring this would lead.
> >
> > In addition to that, im not aware of the current roadmap of Arrow and if
> > these would fit into Drills roadmap. Maybe Arrow would go into a
> different
> > direction than Drill and what should we do, if Drill is bound to Arrow
> then?
> >
> > On the other hand, Arrow could help Drill to a wider adoption with
> clients
> > like pyarrow, arrow-flight, various other programming languages etc. and
> > (im not sure about that) maybe its a performance benefit if Drill use
> Arrow
> > to read Data from HDFS(example), useses Arrow to work with it during
> > execution and gives the vectors directly to my Python(example) programm
> via
> > arrow-flight so that i can Play around with Pandas, etc.
> >
> > Just some thoughts i have since i have used Dremio with pyarrow and Drill
> > with odbc connections.
> >
> > Regards
> > Christian
> >  Original-Nachricht 
> > Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
> >
> >
> > Thanks Ted for the perspective! I had always wished to be a "fly on the
> > wall" in those conversations. :-)
> > -- C
> >
> > > On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
> > >
> > > Hello all,
> > > There was a discussion in a recently closed PR [1] with a discussion
> > between z0ltrix, James Turton and a few others about integrating Drill
> with
> > Apache Arrow and wondering why it was never done. I'd like to share my
> > perspective as someone who has been around Drill for some time but also
> as
> > someone who never worked for MapR or Dremio. This just represents my
> > understanding of events as an outsider, and I could be wrong about some
> or
> > all of this. Please forgive (or correct) any inaccuracies.
> > >
> > > When I first learned of Arrow and the idea of integrating Arrow with
> > Drill, the thing that interested me the most was the ability to move data
> > between platforms

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-100610


   One last note. Let's assume we wanted to adopt the row-based format (or, the 
myths being strong, we want to adopt Arrow.) How would we go about it?
   
   The "brute force" approach is to rewrite all the operators. Must deal with 
low-level vector code, so we'd rewrite that with low-level row (or Arrow) code. 
Since we can't really test until all operators are converted, we'd have to do 
the entire conversion in one huge effort. Then, we get to debug. I hope this 
approach is setting off alarm bells: it is high cost and high risk. This is why 
Drill never seriously entertained the change.
   
   But, there is another solution. The scan readers all used to work directly 
with vectors. (Parquet still does.) Because of the memory reasons explained 
above, we converted most of them to use EVF. As a result, we could swap vectors 
for row pages (or Arrow) by changing the low-level code. Readers would be 
blissfully ignorant of such changes because the higher-level abstractions would 
be unchanged.
   
   So, a more sane way to approach a change of in-memory representations is to 
first convert the other operators to use an EVF-like approach. (EVF for writing 
new batches, a "Result Set Loader" for reading exiting batches.) Such a change 
can be done gradually, operator-by-operator, and is fully compatible with 
other, non-converted operators. No big bang.
   
   Once everything is upgraded to EVF, then we can swap out the in-memory 
format. Maybe try Arrow. Try a row-based format. Run tests. Pick the winner.
   
   This is *not* a trivial exercise, but it is doable over time, if we see 
value and can muster the resources.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-100084


   OK, so the above raise the issues we have to consider when thinking about 
vectors (Drill's or Arrow's.) What is the alternative?
   
   Here, I think Impala got it right. Impala uses Parquet (columnar) for 
*storage*, but rows for *internal* processing. Impala is like an Italian sports 
car of old: spends lots of time in the shop, but when it works, it is very fast.
   
   One of the reasons that Impala is fast is because of the row format.  First, 
let's describe what "row-based" means. It means that columns appear together, 
as in a C `struct`, with rows packed one after another as in an array of 
`structs`. This means that the data for a given row is contiguous. There is 
only one buffer to size. Classic DB stuff that seems unusual only because we're 
all used to Drill's vector format.
   
   Let's look at the same issues above, but from a row-based perspective.
   
   **Expression Execution**: With a row-based model, the CPU can easily load a 
single row into the L1 cache. All our crufty-real-world expression logic works 
on that single row. So, no matter how messy the expressions, from the CPU's 
perspective, all the data is in that single row, which fits nicely into the 
cache.
   
   Rows can be small (a few dozen bytes) or large (maybe a 10s of K for long 
VARCHARs). In either case, they are far smaller than the L1 cache. The row is 
loaded. Once we move onto the next row, we'll never visit the previous one, so 
we don't care if the CPU flushes it from the cache.
   
   **Memory Allocation**: Rows reside in buffers (AKA "pages"), typically of a 
fixed size. A reader "pours" data into a row. When the page is full, that last 
record is copied to the next page. Only that one row is copied, not all the 
existing data. So, we eliminate the 1X copy + 1X load problem in Drill. Since 
there is only one page to allocate, memory is simpler. Since pages are of fixed 
size, memory management is simpler as well.
   
   **Exchanges**: Network exchanges are row-based. Rows are self-contained. A 
network sender can send single rows, if that is efficient, or batches of rows. 
In our 100-senders-100-receiver example, we could send rows as soon as they are 
available. The receiver starts working as soon as the first row is available. 
There is no near-deadlock from excessive buffering.
   
   Yes, we would want to buffer rows (into pages) for efficiency. But, in 
extreme cases, we can send small numbers of rows to keep the DAG flowing.
   
   **Clients**: As noted above, row-based clients are the classic solution and 
are simple to write. We could easily support proper clients in Python, Go, Rust 
and anything else if we used a row-based format.
   
   **Conclusion**: We tend to focus on the "value vector vs. Arrow" discussion. 
I'm here to say that that is the wrong question: it buys into myths which have 
hurt Drill for years. The *correct* question is: what is the most efficient 
format for the use cases where Drill wants to excel? The above suggests that, 
rather than Arrow, a better solution is to adopt a row-based internal format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004442320


   The last topic is so complex that no myth has grown up around it, and the 
issue is not at all well understood. Vectors (and batches) are hell for 
distributed system performance. This gets pretty techie, so hang on.
   
   **Vectors are Hell for Exchanges**:  This comes from a real-world case in 
which a large cluster worked no faster than a single thread of execution. We've 
discussed how Drill wants to create large batches (per the myth) to benefit 
from vectorization (which we don't have) and to optimize L1 cache usage (which, 
as we've seen, we don't actually do.) Let's assume "small" batches of 1K rows.
   
   Drill also wants the single format for in-memory and over-the-wire usage. 
This means we want to transfer 1K record batches so that the receiver gets 
batches of the optimal in-memory size.
   
   Now, what happens in a distributed system? Assume you have 100 fragments 
running. (Maybe 10 machines with 10 cores each.) Let's think about one 
fragment, call it "f0.0". Let's assume f.0.0 runs a scan and a network sender. 
The scan builds up its 1K batches, because those are "efficient" (according to 
the myths we've discussed.)
   
   What does f0.0's network sender do? Let's assume the target is a hash join. 
So, the sender hashes the keys into 100 buckets. Now, the sender follows 
Drill's rule: send 1K record batches. Since there are 100 targets, the sender 
has to create 100 buffered batches, fill them each to 1K records, then send 
them. To visualize:
   
   `f0.0 (reader --> sender) - - > f1.x (receiver --> hash-join --> ...) ...`
   
   There are 100 f0 fragments: f0.0, ... f0.99, we're looking just at one of 
them: f0.0. The f0 "slice" sends to the "f1" slice that consists of 100 
additional fragments: f1.0, ... f1.99.
   
   So, what happens in our sender? Assuming even hash distribution, we have to 
fill all our 100 outgoing batches before we can send them. This means we have 
to read 100 * 1K = 100K input records before we send the first outgoing batch. 
The result is a huge memory usage (those 100 batches), plus all the vector 
resizes and copies we discussed (as we grow those batches.)
   
   If that we not bad enough, this occurs in all our other 99 f0 fragments: 
we've got 100 * 100 = 10K buffered batches waiting to send. Yikes!
   
   Now, what happens in f1? It is sitting around waiting for data. No f0 will 
send until if fills its first outgoing batch for that receiver. If we assume an 
even distribution of data, then the outgoing batches fill at about the same 
rate. None can be sent until one of them reaches the target, at which point 
most of them are near-full. Once the first hits the 1K mark, off it goes to f1 
who can filly start processing. This is bad because Drill claims to be highly 
distributed, but we just described is a serial way of working.
   
   But, it gets worse! Now, assume we're deeper in the DAG, at a sort:
   
   `f4: (receiver --> sort --> sender) - - > f4: (receiver --> merge --> ...)`
   
   The sort sorts its slice of records, and sends it to the merge fragment 
which merges all the partial sorts. Classic distributed systems stuff. Again, 
the f4 (sort) sender waits to fill its outgoing batches, then it sends. The 
merge can't start until it sees batches from all 100 inputs. So, it proceeds at 
the rate of the slowest sort.
   
   Now what happens? The merge uses up one of the 100 input batches, and needs 
another before it can proceed. But, here things get really nasty.
   
   On the f4 side, f4.0, say, sent the first batch to get full. It then sent 
the others as they filled. Meanwhile, the first batch started refilling and 
eventually will need to be sent again. Since the merges can't read a new batch 
until its used up the previous one, it blocks the f4 sender. As a result, f4 
can't send to *any* other merge.
   
   The downstream fragment throttles the upstream, and visa versa. Not quite 
deadlock, but the entire system becomes serialized: the sort can't ship batches 
until the slowest merge can receive them. The merge can't make progress until 
the slowest sort provides the next batch. Every fragment depends on every 
other. Disaster!
   
   Again, we spent hours trying to figure this out on a customer cluster. We 
could see the effect, but we could not get in to work out the details. Would be 
great for someone to do the experiments.
   
   **Summary**: The above has debunked the major myths around columnar storage 
within a query engine. Note that **none** of the above changes if we use Arrow. 
We'd do a huge amount of work to switch, and be stuck with the same fundamental 
problems.
   
   Hence, we have to think deeply about this issue, not just by the snake oil 
that "vectors are good for an execution engine." Good old solid engineering and 
experimentation will tell us what's what.
   


-- 
This is an automated message

[GitHub] [drill] paul-rogers edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and 
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is 
why Parquet uses it, Druid uses it for segment files, and various DBs use it 
for storage. The question we want to ask is, do those benefits apply to the 
format within the Drill execution engine? I'm here to suggest that columnar has 
no advantage, and many disadvantages, when used as the *internal* format of an 
execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill 
(columnar) and Impala (row-based) are two well-known examples. This has given 
me a unique opportunity to see if all the marketing claims for columnar (which 
still appear in the videos on Drill's website) actually hold up in practice. 
Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start 
with the biggest issues, then work toward the details. So, rather than tinker 
with the details of vector execution, let's look at the fundamental issues.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized 
execution. Of course, a minor myth is that Drill uses such execution (it 
doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a 
large amount of data. Think the dot-product operation for neural networks, or 
data compression, or image transforms, or graphics. In all cases, we apply a 
simple operation (rescale, say) to a large amount of homogeneous data (the 
pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now 
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL 
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed 
business logic, and that logic is based on messy human rules, not crisp, clean 
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, 
etc.), lots of function calls (all those cool UDFs which @cgivre has written), 
and so on. Plus, as noted above, SQL deals with NULL values, which must 
short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most 
queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that 
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The 
idea is that one can load a vector into the L1 cache, then zip through many 
values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. 
Each calculation tends to involve multiple columns: `net_receipts = sales + 
taxes - returns`. Here each calculation touches four vectors, so we need all 
four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many 
that occur on each row. In the ideal case, the calculations for independent 
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we 
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load 
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this 
way (it does all the calculations for a single row before moving to the next), 
but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: 
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must 
load the `a` vector into the L1 cache multiple times. (Or, more properly, its 
values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone 
bought into the "L1 cache" myth. To get better performance everyone wanted ever 
larger vectors. In the code, you'll see that we started with 1K-row batches, 
then it grew to 4K, then other code would create 64K row batches. It got so bad 
we'd allocate vectors larger than 16MB, which caused memory fragmentation and 
OOM errors. (This is the original reason for what evolved to be "EVF": to 
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is 
already 16x the L1 cache size. Combine that with real-world expressions and we 
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache 
size. The result is lots of cache misses. (This stuff is really hard to 
measure, would be great for someone to do the experiments to show this 
happening in practice.)
   
   **Myth:

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and 
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is 
why Parquet uses it, Druid uses it for segment files, and various DBs use it 
for storage. The question we want to ask is, do those benefits apply to the 
format within the Drill execution engine? I'm here to suggest that columnar has 
no advantage, and many disadvantages, when used as the *internal* format of an 
execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill 
(columnar) and Impala (row-based) are two well-known examples. This has given 
me a unique opportunity to see if all the marketing claims for columnar (which 
still appear in the videos on Drill's website) actually hold up in practice. 
Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start 
with the biggest issues, then work toward the details. So, rather than tinker 
with the details of vector execution, let's look at the fundamental issues.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized 
execution. Of course, a minor myth is that Drill uses such execution (it 
doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a 
large amount of data. Think the dot-product operation for neural networks, or 
data compression, or image transforms, or graphics. In all cases, we apply a 
simple operation (rescale, say) to a large amount of homogeneous data (the 
pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now 
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL 
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed 
business logic, and that logic is based on messy human rules, not crisp, clean 
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, 
etc.), lots of function calls (all those cool UDFs which @cgivre has written), 
and so on. Plus, as noted above, SQL deals with NULL values, which must 
short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most 
queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that 
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The 
idea is that one can load a vector into the L1 cache, then zip through many 
values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. 
Each calculation tends to involve multiple columns: `net_receipts = sales + 
taxes - returns`. Here each calculation touches four vectors, so we need all 
four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many 
that occur on each row. In the ideal case, the calculations for independent 
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we 
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load 
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this 
way (it does all the calculations for a single row before moving to the next), 
but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: 
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must 
load the `a` vector into the L1 cache multiple times. (Or, more properly, its 
values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone 
bought into the "L1 cache" myth. To get better performance everyone wanted ever 
larger vectors. In the code, you'll see that we started with 1K-row batches, 
then it grew to 4K, then other code would create 64K row batches. It got so bad 
we'd allocate vectors larger than 16MB, which caused memory fragmentation and 
OOM errors. (This is the original reason for what evolved to be "EVF": to 
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is 
already 16x the L1 cache size. Combine that with real-world expressions and we 
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache 
size. The result is lots of cache misses. (This stuff is really hard to 
measure, would be great for someone to do the experiments to show this 
happening in practice.)
   
   **Myth: Vectors

[GitHub] [drill] paul-rogers commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004406533


   @luocooong, here are answers to your questions:
   
   **Code gen**: Drill already supports "plain Java" code gen and use of the 
standard compiler without byte code fixup. It is what is used when you set the 
magic flag in each operator, then ask to save code for debugging. In the tests 
I did way back when, he "plain Java" path performed at least as well as the 
Janino/byte-code-fixup path.
   
   If you are not familiar with the "save code for debugging" mechanism, you 
should be if you want to look at optimization. I'd by happy to describe it (or 
hunt down to see if it is already described in the Wiki.)
   
   **Provided schema**: There are three cases to consider.
   
   1. Explicit SELECT: `SELECT a, b, c FROM ...`. In this case, if we have a 
schema, then all operators will use exactly the same code and we can generate 
once.
   2. "Lenient" wildcard: `SELECT * FROM ...`, where the file (such as JSON or 
CSV) may have more columns than described by the "provided schema". In this 
case, each reader is free to add the extra columns. Since each file may be 
different, each reader will produce a different schema, and downstream 
operators must deal with schema-on-read; the code cannot be shared.
   3. "Strict" wildcard: readers include only those columns defined in the 
schema. For this option, we can also generate code once. 
   
   **Refactors**: there are probably some random assortment of tickets filed as 
various people looked into this area. However, this is more than a "change 
this, improve that" kind of thing, it probably needs someone to spend time to 
fully understand what we have today and to do some research to see if there are 
ways to improve the execution model. Hence, this discussion.
   
   **Vectorization**: that is a complex discussion. I'll tackle that in another 
note. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] cgivre merged pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



cgivre merged pull request #2420:
URL: https://github.com/apache/drill/pull/2420


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] vvysotskyi commented on pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



vvysotskyi commented on pull request #2420:
URL: https://github.com/apache/drill/pull/2420#issuecomment-1004379219


   @cgivre, I think something like [Parameterized 
tests](https://github.com/junit-team/junit4/wiki/parameterized-tests) could 
help to do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] cgivre commented on pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



cgivre commented on pull request #2420:
URL: https://github.com/apache/drill/pull/2420#issuecomment-1004366424


   > @cgivre, yes, it is a good idea. I've noticed that we have a lot of common 
test cases for different databases. It would be good to refactor those tests to 
avoid copying them.
   
   How did you have in mind?  When I wrote the JDBC reader, I added tests for 
Postgres and one other.  I duplicated all the writer tests for a bunch of 
databases.  Is there a more efficient way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

Christian,

Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
with Julia and Python).

I don't think anybody is saying that Drill wouldn't be well set with a
switch to Arrow or even just interfaces to Arrow. But it is a lot of work
to make it all happen.



On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:

> Hi Charles, Ted, and the others here,
>
> it is very interesting to hear the evolution of Drill, Dremio and Arrow in
> that context and thank you Charles for restarting that discussion.
>
> I think, and James mentioned this in the PR as well, that Drill could
> benefit from the continues progress, the Arrow project has made since its
> separation from Drill. And the arrow Community seems to be large, so i
> assume this goes on and on with improvements, new features, etc. but i have
> not enough experience in Drill internals to have an Idea in which mass of
> refactoring this would lead.
>
> In addition to that, im not aware of the current roadmap of Arrow and if
> these would fit into Drills roadmap. Maybe Arrow would go into a different
> direction than Drill and what should we do, if Drill is bound to Arrow then?
>
> On the other hand, Arrow could help Drill to a wider adoption with clients
> like pyarrow, arrow-flight, various other programming languages etc. and
> (im not sure about that) maybe its a performance benefit if Drill use Arrow
> to read Data from HDFS(example), useses Arrow to work with it during
> execution and gives the vectors directly to my Python(example) programm via
> arrow-flight so that i can Play around with Pandas, etc.
>
> Just some thoughts i have since i have used Dremio with pyarrow and Drill
> with odbc connections.
>
> Regards
> Christian
>  Original-Nachricht 
> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
>
>
> Thanks Ted for the perspective! I had always wished to be a "fly on the
> wall" in those conversations. :-)
> -- C
>
> > On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
> >
> > Hello all,
> > There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done. I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio. This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this. Please forgive (or correct) any inaccuracies.
> >
> > When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data. From my
> understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration. The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
> >
> > I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
> >
> > With that said, there was a renewed discussion on the list [2] where
> Paul Rogers proposed what he described as a "Crude but Effective" approach
> to an Arrow integration.
> >
> > This is in the email link but here was a part of Paul's email:
> >
> >> Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >>
> >> A crude first-approximation of the development steps:
> >> 1. Create the client shell server.
> >> 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> >> 3. Forward the query to Drill using the native Drill client.
> >> 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> >> 5. Then, solve that memory allocator problem to pass data without
> copying.
> >
> > One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill. Of
> course, this could be something

[GitHub] [drill] vvysotskyi commented on pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



vvysotskyi commented on pull request #2420:
URL: https://github.com/apache/drill/pull/2420#issuecomment-1004328836


   @cgivre, yes, it is a good idea. I've noticed that we have a lot of common 
test cases for different databases. It would be good to refactor those tests to 
avoid copying them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



vvysotskyi commented on a change in pull request #2420:
URL: https://github.com/apache/drill/pull/2420#discussion_r777680441



##
File path: 
contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/rules/JdbcLimitRule.java
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.jdbc.rules;
+
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelTrait;
+import org.apache.calcite.rel.RelCollations;
+import org.apache.calcite.sql.dialect.MssqlSqlDialect;
+import org.apache.drill.exec.planner.common.DrillLimitRelBase;
+import org.apache.drill.exec.store.enumerable.plan.DrillJdbcRuleBase;
+import org.apache.drill.exec.store.jdbc.DrillJdbcConvention;
+
+public class JdbcLimitRule extends DrillJdbcRuleBase.DrillJdbcLimitRule {
+  private final DrillJdbcConvention convention;
+
+  public JdbcLimitRule(RelTrait in, DrillJdbcConvention out) {
+super(in, out);
+this.convention = out;
+  }
+
+  @Override
+  public boolean matches(RelOptRuleCall call) {
+DrillLimitRelBase limit = call.rel(0);
+if (super.matches(call)) {
+  return limit.getOffset() == null
+|| !limit.getTraitSet().contains(RelCollations.EMPTY)
+|| !(convention.getPlugin().getDialect() instanceof MssqlSqlDialect);

Review comment:
   Thanks, added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Z0ltrix

Hi Charles, Ted, and the others here,

it is very interesting to hear the evolution of Drill, Dremio and Arrow in that 
context and thank you Charles for restarting that discussion.

I think, and James mentioned this in the PR as well, that Drill could benefit 
from the continues progress, the Arrow project has made since its separation 
from Drill. And the arrow Community seems to be large, so i assume this goes on 
and on with improvements, new features, etc. but i have not enough experience 
in Drill internals to have an Idea in which mass of refactoring this would lead.

In addition to that, im not aware of the current roadmap of Arrow and if these 
would fit into Drills roadmap. Maybe Arrow would go into a different direction 
than Drill and what should we do, if Drill is bound to Arrow then?

On the other hand, Arrow could help Drill to a wider adoption with clients like 
pyarrow, arrow-flight, various other programming languages etc. and (im not 
sure about that) maybe its a performance benefit if Drill use Arrow to read 
Data from HDFS(example), useses Arrow to work with it during execution and 
gives the vectors directly to my Python(example) programm via arrow-flight so 
that i can Play around with Pandas, etc.

Just some thoughts i have since i have used Dremio with pyarrow and Drill with 
odbc connections.

Regards
Christian
\ Original-Nachricht 
Am 3. Jan. 2022, 20:08, Charles Givre schrieb:

>
>
>
> Thanks Ted for the perspective! I had always wished to be a "fly on the wall" 
> in those conversations. :-)
> \-- C
>
> > On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
> >
> > Hello all,
> > There was a discussion in a recently closed PR \[1\] with a discussion 
> > between z0ltrix, James Turton and a few others about integrating Drill with 
> > Apache Arrow and wondering why it was never done. I'd like to share my 
> > perspective as someone who has been around Drill for some time but also as 
> > someone who never worked for MapR or Dremio. This just represents my 
> > understanding of events as an outsider, and I could be wrong about some or 
> > all of this. Please forgive (or correct) any inaccuracies.
> >
> > When I first learned of Arrow and the idea of integrating Arrow with Drill, 
> > the thing that interested me the most was the ability to move data between 
> > platforms without having to serialize/deserialize the data. From my 
> > understanding, MapR did some research and didn't find a significant 
> > performance advantage and hence didn't really pursue the integration. The 
> > other side of it was that it would require a significant amount of work to 
> > refactor major parts of Drill.
> >
> > I don't know the internal politics, but this was one of the major points of 
> > diversion between Dremio and Drill.
> >
> > With that said, there was a renewed discussion on the list \[2\] where Paul 
> > Rogers proposed what he described as a "Crude but Effective" approach to an 
> > Arrow integration.
> >
> > This is in the email link but here was a part of Paul's email:
> >
> >> Charles, just brainstorming a bit, I think the easiest way to start is to 
> >> create a simple, stand-alone server that speaks Arrow to the client, and 
> >> uses the native Drill client to speak to Drill. The native Drill client 
> >> exposes Drill value vectors. One trick would be to convert Drill vectors 
> >> to the Arrow format. I think that data vectors are the same format. 
> >> Possibly offset vectors. I think Arrow went its own way with null-value 
> >> (Drill's is-set) vectors. So, some conversion might be a no-op, others 
> >> might need to rewrite a vector. Good thing, this is purely at the vector 
> >> level, so would be easy to write. The next issue is the one that Parth has 
> >> long pointed out: Drill and Arrow each have their own memory allocators. 
> >> How could we share a data vector between the two? The simplest initial 
> >> solution is just to copy the data from Drill to Arrow. Slow, but 
> >> transparent to the client. A crude first-approximation of the development 
> >> steps:
> >>
> >> A crude first-approximation of the development steps:
> >> 1. Create the client shell server.
> >> 2. Implement the Arrow client protocol. Need some way to accept a query 
> >> and return batches of results.
> >> 3. Forward the query to Drill using the native Drill client.
> >> 4. As a first pass, copy vectors from Drill to Arrow and return them to 
> >> the client.
> >> 5. Then, solve that memory allocator problem to pass data without copying.
> >
> > One point that Paul made was that these pieces are fairly discrete and 
> > could be implemented without refactoring major components of Drill. Of 
> > course, this could be something for Drill 2.0. At a minimum, could we take 
> > the conversation off of the PR and put it in the email list? ;-)
> >
> > Let's discuss... All ideas are welcome!
> >
> > Best,
> > -- C
> >
> >
> > \[1\]: https://github.com/apache/drill/pull/2412 
> >

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Charles Givre

Thanks Ted for the perspective!  I had always wished to be a "fly on the wall" 
in those conversations.  :-)
-- C

> On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
> 
> Hello all, 
> There was a discussion in a recently closed PR [1] with a discussion between 
> z0ltrix, James Turton and a few others about integrating Drill with Apache 
> Arrow and wondering why it was never done.  I'd like to share my perspective 
> as someone who has been around Drill for some time but also as someone who 
> never worked for MapR or Dremio.  This just represents my understanding of 
> events as an outsider, and I could be wrong about some or all of this.   
> Please forgive (or correct) any inaccuracies. 
> 
> When I first learned of Arrow and the idea of integrating Arrow with Drill, 
> the thing that interested me the most was the ability to move data between 
> platforms without having to serialize/deserialize the data.  From my 
> understanding, MapR did some research and didn't find a significant 
> performance advantage and hence didn't really pursue the integration.  The 
> other side of it was that it would require a significant amount of work to 
> refactor major parts of Drill. 
> 
> I don't know the internal politics, but this was one of the major points of 
> diversion between Dremio and Drill.
> 
> With that said, there was a renewed discussion on the list [2] where Paul 
> Rogers proposed what he described as a "Crude but Effective" approach to an 
> Arrow integration.  
> 
> This is in the email link but here was a part of Paul's email:
> 
>> Charles, just brainstorming a bit, I think the easiest way to start is to 
>> create a simple, stand-alone server that speaks Arrow to the client, and 
>> uses the native Drill client to speak to Drill. The native Drill client 
>> exposes Drill value vectors. One trick would be to convert Drill vectors to 
>> the Arrow format. I think that data vectors are the same format. Possibly 
>> offset vectors. I think Arrow went its own way with null-value (Drill's 
>> is-set) vectors. So, some conversion might be a no-op, others might need to 
>> rewrite a vector. Good thing, this is purely at the vector level, so would 
>> be easy to write. The next issue is the one that Parth has long pointed out: 
>> Drill and Arrow each have their own memory allocators. How could we share a 
>> data vector between the two? The simplest initial solution is just to copy 
>> the data from Drill to Arrow. Slow, but transparent to the client. A crude 
>> first-approximation of the development steps:
>> 
>> A crude first-approximation of the development steps: 
>> 1. Create the client shell server. 
>> 2. Implement the Arrow client protocol. Need some way to accept a query and 
>> return batches of results. 
>> 3. Forward the query to Drill using the native Drill client. 
>> 4. As a first pass, copy vectors from Drill to Arrow and return them to the 
>> client. 
>> 5. Then, solve that memory allocator problem to pass data without copying.
> 
> One point that Paul made was that these pieces are fairly discrete and could 
> be implemented without refactoring major components of Drill.  Of course, 
> this could be something for Drill 2.0.  At a minimum, could we take the 
> conversation off of the PR and put it in the email list? ;-)
> 
> Let's discuss... All ideas are welcome!
> 
> Best,
> -- C
> 
> 
> [1]: https://github.com/apache/drill/pull/2412 
> 
> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l 
> 
> 
> 
>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

As a little bit of perspective from somebody who *was* at MapR at the time,
here are my recollections.

Arrow is pretty much the value vectors from Drill with some lessons learned
and all dependencies removed so that Arrow can be consumed separately from
Drill.

The spinout of the Dremio team didn't happen because of the lack of
integration with Arrow ... it was more the other way around ... because a
significant chunk of the Drill team left to form Dremio, the driving force
that could have pushed for integration just wasn't around any more because
they were off doing Dremio and weren't working on Drill any more very much.
The motive for the spinout had mostly to do with the fact that Tomer and
Jacques recognized the opportunity to build a largely in-memory analytical
engine based on zero serialization techniques and also recognized that this
could never be a priority for MapR because it was outside the center of
mass there. Once the Dremio team was out, though, they had a huge need for
interoperability with systems like Spark and Cassandra, and they needed to
not impose all of Drill as a dependency if they wanted these other systems
to take on Arrow.

This history doesn't really impact the merits or methods of integrating
present-day Drill with Arrow, but it is nice to get the story the right way
around.



On Mon, Jan 3, 2022 at 8:00 AM Charles Givre  wrote:

> Hello all,
> There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done.  I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio.  This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this.   Please forgive (or correct) any inaccuracies.
>
> When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data.  From
> my understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration.  The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
>
> I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
>
> With that said, there was a renewed discussion on the list [2] where Paul
> Rogers proposed what he described as a "Crude but Effective" approach to an
> Arrow integration.
>
> This is in the email link but here was a part of Paul's email:
>
> > Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >
> > A crude first-approximation of the development steps:
> > 1. Create the client shell server.
> > 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> > 3. Forward the query to Drill using the native Drill client.
> > 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> > 5. Then, solve that memory allocator problem to pass data without
> copying.
>
> One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill.  Of
> course, this could be something for Drill 2.0.  At a minimum, could we take
> the conversation off of the PR and put it in the email list? ;-)
>
> Let's discuss... All ideas are welcome!
>
> Best,
> -- C
>
>
> [1]: https://github.com/apache/drill/pull/2412 <
> https://github.com/apache/drill/pull/2412>
> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l <
> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
>
>
>
>

[GitHub] [drill] cgivre commented on pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



cgivre commented on pull request #2420:
URL: https://github.com/apache/drill/pull/2420#issuecomment-1004274089


   One other thing, I'd like to add UTs for MS-SQL to the JDBC plugin for both 
reading and writing.  But not in this task ;-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on a change in pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



jnturton commented on a change in pull request #2420:
URL: https://github.com/apache/drill/pull/2420#discussion_r777624425



##
File path: 
contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/rules/JdbcLimitRule.java
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.jdbc.rules;
+
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelTrait;
+import org.apache.calcite.rel.RelCollations;
+import org.apache.calcite.sql.dialect.MssqlSqlDialect;
+import org.apache.drill.exec.planner.common.DrillLimitRelBase;
+import org.apache.drill.exec.store.enumerable.plan.DrillJdbcRuleBase;
+import org.apache.drill.exec.store.jdbc.DrillJdbcConvention;
+
+public class JdbcLimitRule extends DrillJdbcRuleBase.DrillJdbcLimitRule {
+  private final DrillJdbcConvention convention;
+
+  public JdbcLimitRule(RelTrait in, DrillJdbcConvention out) {
+super(in, out);
+this.convention = out;
+  }
+
+  @Override
+  public boolean matches(RelOptRuleCall call) {
+DrillLimitRelBase limit = call.rel(0);
+if (super.matches(call)) {
+  return limit.getOffset() == null
+|| !limit.getTraitSet().contains(RelCollations.EMPTY)
+|| !(convention.getPlugin().getDialect() instanceof MssqlSqlDialect);

Review comment:
   Perhaps some kind of comment here would be helpful to someone who visits 
this file not knowing why MS SQL gets special treatment?
   
   `// LIMIT is translated to TOP for MS SQL, c.f. DRILL-8090`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Charles Givre

Hello all, 
There was a discussion in a recently closed PR [1] with a discussion between 
z0ltrix, James Turton and a few others about integrating Drill with Apache 
Arrow and wondering why it was never done.  I'd like to share my perspective as 
someone who has been around Drill for some time but also as someone who never 
worked for MapR or Dremio.  This just represents my understanding of events as 
an outsider, and I could be wrong about some or all of this.   Please forgive 
(or correct) any inaccuracies. 

When I first learned of Arrow and the idea of integrating Arrow with Drill, the 
thing that interested me the most was the ability to move data between 
platforms without having to serialize/deserialize the data.  From my 
understanding, MapR did some research and didn't find a significant performance 
advantage and hence didn't really pursue the integration.  The other side of it 
was that it would require a significant amount of work to refactor major parts 
of Drill. 

I don't know the internal politics, but this was one of the major points of 
diversion between Dremio and Drill.

With that said, there was a renewed discussion on the list [2] where Paul 
Rogers proposed what he described as a "Crude but Effective" approach to an 
Arrow integration.  

This is in the email link but here was a part of Paul's email:

> Charles, just brainstorming a bit, I think the easiest way to start is to 
> create a simple, stand-alone server that speaks Arrow to the client, and uses 
> the native Drill client to speak to Drill. The native Drill client exposes 
> Drill value vectors. One trick would be to convert Drill vectors to the Arrow 
> format. I think that data vectors are the same format. Possibly offset 
> vectors. I think Arrow went its own way with null-value (Drill's is-set) 
> vectors. So, some conversion might be a no-op, others might need to rewrite a 
> vector. Good thing, this is purely at the vector level, so would be easy to 
> write. The next issue is the one that Parth has long pointed out: Drill and 
> Arrow each have their own memory allocators. How could we share a data vector 
> between the two? The simplest initial solution is just to copy the data from 
> Drill to Arrow. Slow, but transparent to the client. A crude 
> first-approximation of the development steps:
> 
> A crude first-approximation of the development steps: 
> 1. Create the client shell server. 
> 2. Implement the Arrow client protocol. Need some way to accept a query and 
> return batches of results. 
> 3. Forward the query to Drill using the native Drill client. 
> 4. As a first pass, copy vectors from Drill to Arrow and return them to the 
> client. 
> 5. Then, solve that memory allocator problem to pass data without copying.

One point that Paul made was that these pieces are fairly discrete and could be 
implemented without refactoring major components of Drill.  Of course, this 
could be something for Drill 2.0.  At a minimum, could we take the conversation 
off of the PR and put it in the email list? ;-)

Let's discuss... All ideas are welcome!

Best,
-- C


[1]: https://github.com/apache/drill/pull/2412 

[2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l

[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004186672


   @luocooong @paul-rogers I've always thought of Drill's code gen as being an 
effort to present a good target for the JVM's auto-vectorisation.  Not that 
this is likely to get the same results as the new SIMD intrinsics in the JVM, 
or a nice way to code.  Is this on the mark?  A reference:
   
   https://cr.openjdk.java.net/~vlivanov/talks/2019_CodeOne_MTE_Vectors.pdf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] luocooong commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



luocooong commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004154855


   @paul-rogers Hello, thanks for the information. I just have a few questions 
I'd like to know..
   - **Code gen path** :  Are you talking about the `Code Generation Workflow`? 
If we are going to use the native java tools, is there anything we can do for 
this?
   - **Use provided-schema** : If we provide the schema at query time, does it 
mean that we can also generate code once like Spark?
   - **Refactors and changes** : Is there any reference old tickets were used 
to rewrite and improve the CG path?
   - **Vectorization** : As I understand, Drill only implements vector storage, 
but had not implemented the vectorization base on these vector value, is that 
correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (DRILL-8051) Update the JQuery for Vulnerability issue

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8051.
-
Resolution: Fixed

> Update the JQuery for Vulnerability issue
> -
>
> Key: DRILL-8051
> URL: https://issues.apache.org/jira/browse/DRILL-8051
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: 1.19
>Affects Versions: 1.19.0
>Reporter: Jingchuan Hu
>Assignee: James Turton
>Priority: Major
> Fix For: Future, 1.19.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Apache drill used JQuery 3.4.1 which has the vulnerability report form CVE.
> [https://snyk.io/vuln/npm:jquery]
> Needs to update JQuery version from 3.4.1 to 3.6.0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (DRILL-8070) format-excel assumes that rowIterator returns every row

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8070.
-
  Assignee: Charles Givre
Resolution: Fixed

> format-excel assumes that rowIterator returns every row
> ---
>
> Key: DRILL-8070
> URL: https://issues.apache.org/jira/browse/DRILL-8070
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Assignee: Charles Givre
>Priority: Major
>
> In ExcelBatchReader, this code makes the wrong assumption:
> {code:java}
>     for (int i = 1; i < rowNumber; i++) {
>          currentRow = rowIterator.next();
>     } {code}
>  
> There are 2 for loops like this.
> Empty Rows will not necessarily be returned by the iterator. Basically, rows 
> without populated cells could easily be skipped. Think of the Sheet as being 
> represented as a sparse matrix - because it is stored like this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (DRILL-8072) Fix NPE in HTTP Post Requests

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8072.
-
Resolution: Fixed

> Fix NPE in HTTP Post Requests
> -
>
> Key: DRILL-8072
> URL: https://issues.apache.org/jira/browse/DRILL-8072
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Other
>Affects Versions: 1.19.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
> Fix For: 1.20.0
>
>
> There was a minor bug in the HTTP Storage Plugin with POST requests.  If the 
> `postBody` configuration parameter is null, the plugin throws an NPE. 
> This PR adds a null check which prevents the NPE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [drill] vvysotskyi opened a new pull request #2420: DRILL-8090: LIMIT clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server

2022-01-03 Thread GitBox



vvysotskyi opened a new pull request #2420:
URL: https://github.com/apache/drill/pull/2420


   # [DRILL-8090](https://issues.apache.org/jira/browse/DRILL-8090): LIMIT 
clause is pushed down to an invalid OFFSET-FETCH clause for MS SQL Server
   
   ## Description
   - Updated Calcite fork version to include 
https://github.com/apache/calcite/commit/cc40a48cb8ca16f91bfdc66eaed6151805355d4b,
 so now regular limit can be pushed down to MS SQL as `TOP N` instead of 
`FETCH`.
   - Updated `JdbcLimitRule` and `JdbcSortRule` to prevent pushing down `FETCH` 
with `OFFSET` and without `ORDER BY`. 
   For such a case, some rules at the physical stage will generate a limit on 
top of the scan that includes `FETCH` only and another limit with `FETCH` and 
`OFFSET` above, so the limit will be pushed down.
   - Allowed matching JDBC rules for physical rel nodes.
   - Fixed issue with ClassCastException for Phoenix plugin (issue similar to 
DRILL-7972).
   
   ## Documentation
   NA
   
   ## Testing
   Checked manually.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (DRILL-8079) Upgrade logback because of CVE-2021-42550

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8079.
-
Resolution: Fixed

> Upgrade logback because of CVE-2021-42550
> -
>
> Key: DRILL-8079
> URL: https://issues.apache.org/jira/browse/DRILL-8079
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Jingchuan Hu
>Priority: Major
>
> Due to the CVE-2021-42550 [https://github.com/advisories/GHSA-668q-qrv7-99fm, 
> |https://github.com/advisories/GHSA-668q-qrv7-99fm]Upgrades logback from 
> 1.2.3 to 1.2.9.
> Logback 1.2.9 fixed the vulnerability, please refer to: 
> [http://logback.qos.ch/news.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (DRILL-8081) Maven Connection timed out

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8081.
-
Resolution: Fixed

> Maven Connection timed out
> --
>
> Key: DRILL-8081
> URL: https://issues.apache.org/jira/browse/DRILL-8081
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
> Fix For: 1.20.0
>
>
> {code:java}
> Error:  Failed to execute goal on project drill-format-excel: Could not 
> resolve dependencies for project 
> org.apache.drill.contrib:drill-format-excel:jar:1.20.0-SNAPSHOT: Failed to 
> collect dependencies at 
> com.github.pjfanning:excel-streaming-reader:jar:3.2.6: Failed to read 
> artifact descriptor for 
> com.github.pjfanning:excel-streaming-reader:jar:3.2.6: Could not transfer 
> artifact com.github.pjfanning:excel-streaming-reader:pom:3.2.6 from/to 
> central (https://repo.maven.apache.org/maven2): transfer failed for 
> https://repo.maven.apache.org/maven2/com/github/pjfanning/excel-streaming-reader/3.2.6/excel-streaming-reader-3.2.6.pom:
>  Connection timed out (Read failed) -> [Help 1] {code}
> We can learn from this note [Maven Connection timed 
> out|https://github.com/actions/virtual-environments/issues/1499]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (DRILL-8066) Cannot convert non-finite floating point literals

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8066.
-
Resolution: Fixed

> Cannot convert non-finite floating point literals
> -
>
> Key: DRILL-8066
> URL: https://issues.apache.org/jira/browse/DRILL-8066
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.19.0
>Reporter: James Turton
>Assignee: James Turton
>Priority: Minor
> Fix For: 1.20.0
>
>
> Drill can process floating point values like -Infinity, +Infinity and NaN but 
> it fails to correctly convert queries containing these values as constants 
> during planning.  E.g. this query produces a NumberFormatException.
>  
> {code:java}
> select cast('-Infinity' as float);{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (DRILL-8076) Remove unused Vault token BOOT opt

2022-01-03 Thread James Turton (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton resolved DRILL-8076.
-
Resolution: Fixed

> Remove unused Vault token BOOT opt
> --
>
> Key: DRILL-8076
> URL: https://issues.apache.org/jira/browse/DRILL-8076
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 1.20.0
>Reporter: James Turton
>Assignee: James Turton
>Priority: Trivial
> Fix For: 1.20.0
>
>
> The Vault token config option for the Vault user authenticator is pointless 
> since none of the auth operations require Drill to have its own Vault token.  
> Functionality is unaffected but a pointless option confuses users and adds 
> cruft.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Re: [LAZY VOTE] Delete branches gh-pages and gh-pages-master from apache/drill

2022-01-03 Thread Charles Givre

Hi James, 
I would prefer that we keep all cruft intact in the Drill repo.  In fact, I 
went ahead and created a cruft generator which can add additional cruft to 
areas in which we feel the cruft is insufficient. j/k
Enthusiastic +1 from me. (For removal... not additional cruft) 
-- C

> On Jan 3, 2022, at 6:32 AM, James Turton  wrote:
> 
> Thank you, I found and updated a handful of instances.
> 
> On 2022/01/03 11:05, luoc wrote:
>> James, could you please confirm that there is no link to `gh-pages` directly 
>> in the current document?
>> 
>>> On Jan 3, 2022, at 16:28, James Turton  wrote:
>>> 
>>> It's been about four months since we moved the Drill website source over 
>>> to apache/drill-site.  Things have been working fine and we took the full 
>>> commit history across when we migrated so I propose to delete this cruft 
>>> from apache/drill.
>>> 
>>> Please reply if you object.
>>> 
>>> Thanks
>>> James
>

[GitHub] [drill] cgivre merged pull request #2417: DRILL-8071: upgrade log4j to 2.17.1

2022-01-03 Thread GitBox



cgivre merged pull request #2417:
URL: https://github.com/apache/drill/pull/2417


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton edited a comment on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



jnturton edited a comment on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003928988






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [LAZY VOTE] Delete branches gh-pages and gh-pages-master from apache/drill

2022-01-03 Thread James Turton


Thank you, I found and updated a handful of instances.

On 2022/01/03 11:05, luoc wrote:

James, could you please confirm that there is no link to `gh-pages` directly in 
the current document?


On Jan 3, 2022, at 16:28, James Turton  wrote:

It's been about four months since we moved the Drill website source over to 
apache/drill-site.  Things have been working fine and we took the full commit 
history across when we migrated so I propose to delete this cruft from 
apache/drill.

Please reply if you object.

Thanks
James

Re: Happy new year!

2022-01-03 Thread luoc

Happy New Year 2022!

For the second meetup, I’m going to initiate a quick discussion: Speed Up 
Release
1. Community status
2. Release frequency
3. Contributor Development

> 2022年1月3日 下午3:31，James Turton  写道：
> 
> Hi everyone
> 
> Happy new year to one and all, and here's to all the exciting developments 
> coming our way.
> 
> Firstly: Drill 1.20 has not been forgotten.  We have been holding off while 
> debugging some final issues in DRILL-8061, but the freeze is imminent.
> 
> We've got another community meetup this Friday.  Some folks may of course 
> still be on holiday but at the very least you'll find me on the other of the 
> line if you dial in.  Please reply here if there are any topics you'd like to 
> have added to the agenda...
> 
> Regards
> James

[GitHub] [drill] luocooong commented on pull request #2417: DRILL-8071: upgrade log4j to 2.17.1

2022-01-03 Thread GitBox



luocooong commented on pull request #2417:
URL: https://github.com/apache/drill/pull/2417#issuecomment-100396


   Before that, there was an issue that the Travis CI was frozen :
   ```
   We are unable to start your build at this time.
   You exceeded the number of users allowed for your plan. Please review your 
plan details and follow the steps to resolution.
   ```
   I have contacted their tech support today and they seem to have solved the 
issue for us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [LAZY VOTE] Delete branches gh-pages and gh-pages-master from apache/drill

2022-01-03 Thread luoc



James, could you please confirm that there is no link to `gh-pages` directly in 
the current document?

> On Jan 3, 2022, at 16:28, James Turton  wrote:
> 
> It's been about four months since we moved the Drill website source over to 
> apache/drill-site.  Things have been working fine and we took the full commit 
> history across when we migrated so I propose to delete this cruft from 
> apache/drill.
> 
> Please reply if you object.
> 
> Thanks
> James

[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003948739


   > i'm sure you have already discussed this, but i would like to know why we 
are not migrating to arrow and cannot find any information about this decision. 
As far as i know, arrow was inspired by drill and on the arrow homepage they 
have still the picture with drill on it but drill does not use arrow. 
https://arrow.apache.org/overview/
   > Is there any official statement from the project for the arrow 
support/migration?
   
   @Z0ltrix there isn't an official statement that I know of.  It's too big a 
question for a comment thread answer and a good topic for a community meetup 
with some senior devs present.  I believe that to some extent Drill's vector 
engine has developed in its own direction since Arrow arrived and the best 
route forward for Drill is now not entirely obvious, and needs some thought.  
Significant pros for Arrow are that it is maintained externally and I believe 
its performance is very good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] Z0ltrix commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



Z0ltrix commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003945110


   > > Drill was designed to allow vector operations (hence Value Vectors), but 
the code was never written. In part because there are no CPU vector 
instructions that work with SQL nullable data. Arrow is supposed to have 
figured out solutions (Gandiva, is it?) which, perhaps we could consider (but 
probably only for non-nullable data.)
   > 
   > Hi @paul-rogers, I think that what Arrow does for computations over 
nullable data is store an external null mask and compute results for every 
record, including the null ones where the value vector contains either rubbish 
or some sentinel value. In a second pass, a null mask is computed for the 
result. This results in wasted arithmetic operations for null values, but in 
practice that's better than a branch for every value. Quite possibly even for 
pretty sparse vectors.
   
   i'm sure you have already discussed this, but i would like to know why we 
are not migrating to arrow and cannot find any information about this decision. 
As far as i know, arrow was inspired by drill and on the arrow homepage they 
have still the picture with drill on it but drill does not use arrow. 
https://arrow.apache.org/overview/ 
   Is there any official statement from the project for the arrow 
support/migration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[LAZY VOTE] Delete branches gh-pages and gh-pages-master from apache/drill

2022-01-03 Thread James Turton

It's been about four months since we moved the Drill website source over 
to apache/drill-site.  Things have been working fine and we took the 
full commit history across when we migrated so I propose to delete this 
cruft from apache/drill.


Please reply if you object.

Thanks
James

[GitHub] [drill] jnturton commented on pull request #2412: DRILL-8088: Improve expression evaluation performance

2022-01-03 Thread GitBox



jnturton commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003928988


   > Drill was designed to allow vector operations (hence Value Vectors), but 
the code was never written. In part because there are no CPU vector 
instructions that work with SQL nullable data. Arrow is supposed to have 
figured out solutions (Gandiva, is it?) which, perhaps we could consider (but 
probably only for non-nullable data.)
   
   Hi @paul-rogers, I think that what Arrow does for computations over nullable 
data is store an external null mask and compute results for every record, 
including the null ones where the value vector contains either rubbish or some 
sentinel value.  In a second pass, a null mask is computed for the result.  
This results in wasted arithmetic operations for null values, but in practice 
that's better than a branch for every value.  Quite possibly even for pretty 
sparse vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

45 matches

Mail list logo