Hey Stefan,

It is possible that this is the case. A quick look at the code seems to
indicate that the Avro reader is not overriding the default behavior of
determining approximate row count of files. I believe there is still a
small issue with the code handling tiny files, are the files you are
dealing with at least a few megabytes?

Can you see how many minor fragments are listed under the scan operation in
the query profile? If there are multiple fragments then the scan is
parallelized.

- Jason

On Mon, Feb 29, 2016 at 1:58 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi Jason,
>
> Is it possible that the Avro plugin does not use any parallelism and that
> all the target files are scanned sequentially by the same process?  (1.5)
>
> - Stefán
>
> On Fri, Feb 26, 2016 at 8:04 PM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Thank you Jason.
> >
> > I do realize that this is an OS project and that everyone is doing their
> > best.
> >
> > There are just a few things I wish I had realized before switching over
> > from JSON to Avro that  have caused us a lot of problems and taken a long
> > time.
> >
> > Your work is appreciated and I apologize for letting my frustration get
> > the better of me.
> >
> > - Stefán
> >
> > On Fri, Feb 26, 2016 at 8:00 PM, Jason Altekruse <
> altekruseja...@gmail.com
> > > wrote:
> >
> >> Stefan,
> >>
> >> I'm sorry that we have not been better about getting back to the issues
> >> you
> >> have filed against the Avro reader. We do appreciate all of the effort
> you
> >> have put into filing thorough bugs and being active in the discussions
> on
> >> the list. I have responded on the bug you filed on this issue [1] with a
> >> workaround and will be posting a patch shortly with a fix.
> >>
> >> - Jason <https://issues.apache.org/jira/browse/DRILL-4120>
> >>
> >> [1] - https://issues.apache.org/jira/browse/DRILL-4441
> >> <https://issues.apache.org/jira/browse/DRILL-4120>
> >>
> >> On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter <
> >> ste...@activitystream.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > This query targets Avro files in the latest 1.5 release:
> >> >
> >> > 0: jdbc:drill:zk=local> select count(*) from
> >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
> >> > 'Customer/4-2492847';
> >> > +---------+
> >> > | EXPR$0  |
> >> > +---------+
> >> > | 5788    |
> >> > +---------+
> >> >
> >> > 0: jdbc:drill:zk=local> select count(*) from
> >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
> >> > ('Customer/4-2492847');
> >> > +---------+
> >> > | EXPR$0  |
> >> > +---------+
> >> > | 0       |
> >> > +---------+
> >> >
> >> > It shows that the IN operator does not work with Avro (works with
> >> Parquet).
> >> >
> >> > This finally tips us over. We have invested hundreds of hours moving
> all
> >> > streaming/fresh data from JSON to Avro but the Avro part of Drill is
> >> broken
> >> > in too many ways to recommend its use to anyone.
> >> >
> >> > Attempts to report Avro errors and shortcomings, like the missing
> >> support
> >> > for dirX, has had no results.
> >> >
> >> > I think it would be prudent to warn people on the Drill website that
> the
> >> > Avro support is experimental, at best
> >> >
> >> > - Stefán Baxter
> >> >
> >>
> >
> >
>

Reply via email to