Re: PRs awaiting review

Gian Merlino Wed, 27 May 2020 19:13:08 -0700

Hey Samarth,

It looks like the last PR has been merged already — great!


I just wrote up a review for your first PR, about round robin data types.

I haven't had a chance to check out the unknown-complex-types PR yet;
apologies.

I'm now subscribed to them all, though.

On Fri, May 15, 2020 at 5:03 PM Samarth Jain <samarth.j...@gmail.com> wrote:

> Hi Druid Devs,
>
> I wanted to bring the community's attention to a few PRs that are awaiting
> review and what I believe are worthwhile features and fixes to have in OSS.
>
> Add new round robin strategy for loading segments:
> https://github.com/apache/druid/pull/9603/
> <https://github.com/apache/druid/pull/9603/files>
>
> This PR adds a new strategy that Druid coordinator can use when determining
> what segment to load next. The current and the only strategy is to prefer
> loading the newer segments first. For data being ingested using a streaming
> indexing service, it makes sense to prefer loading the newer segments on
> the historicals as it alleviates pressure off the middle manager nodes by
> expediting the segment handoff process. In case of batch ingestion also, it
> makes sense to prefer loading newer segments first since chances are users
> want to be able to query newer data first. However, there are certain cases
> where such an approach causes pain. For example - if two different
> datasources are ingested with one having newer data compared to the other
> one, it is possible that the segments of the second datasource one may not
> get loaded for a long time. To make things "fair" the approach added in the
> PR instead picks segments by selecting datasources in a round robin
> fashion. For each datasource though, the strategy does make sure that the
> newer segments are loaded first. We have been running clusters with this
> strategy in our clusters for a while now and it has helped our large (order
> of a few TBs) ingest use cases quite well.
>
> The second PR is for handling unknown complex types:
> https://github.com/apache/druid/pull/9422
>
> Recently, while upgrading our cluster, we ran into an issue where the Druid
> SQL functionality broke because an incompatible change was made in an
> aggregator extension. While we obviously shouldn't be making any
> incompatible changes, it doesn't hurt to guard against it (especially for
> folks building in-house Druid extensions) and especially preventing it from
> a major functionality like Druid SQL in this case.
>
> The third PR I actually raised today. But would be good to bring to
> community's attention as I believe it addresses a long standing issue.
> https://github.com/apache/druid/pull/9877
> Internally, and I would be surprised if it isn't common out there, we have
> lots of hive parquet tables that have the timestamp column of type int
> storing the time in the format yyyyMMdd. To ingest such a column as Druid
> timestamp, one would expect that specifying a date time format like
> "yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid
> ignores the format when it sees that column is numeric and instead
> interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends
> up being interpreted as 20200521 milliseconds which corresponds to the
> incorrect datetime value of "Thu Jan 01 1970 05:36:40".
>
> Thanks,
> Samarth
>

Re: PRs awaiting review

Reply via email to