Hey Samarth, It looks like the last PR has been merged already — great!
I just wrote up a review for your first PR, about round robin data types. I haven't had a chance to check out the unknown-complex-types PR yet; apologies. I'm now subscribed to them all, though. On Fri, May 15, 2020 at 5:03 PM Samarth Jain <samarth.j...@gmail.com> wrote: > Hi Druid Devs, > > I wanted to bring the community's attention to a few PRs that are awaiting > review and what I believe are worthwhile features and fixes to have in OSS. > > Add new round robin strategy for loading segments: > https://github.com/apache/druid/pull/9603/ > <https://github.com/apache/druid/pull/9603/files> > > This PR adds a new strategy that Druid coordinator can use when determining > what segment to load next. The current and the only strategy is to prefer > loading the newer segments first. For data being ingested using a streaming > indexing service, it makes sense to prefer loading the newer segments on > the historicals as it alleviates pressure off the middle manager nodes by > expediting the segment handoff process. In case of batch ingestion also, it > makes sense to prefer loading newer segments first since chances are users > want to be able to query newer data first. However, there are certain cases > where such an approach causes pain. For example - if two different > datasources are ingested with one having newer data compared to the other > one, it is possible that the segments of the second datasource one may not > get loaded for a long time. To make things "fair" the approach added in the > PR instead picks segments by selecting datasources in a round robin > fashion. For each datasource though, the strategy does make sure that the > newer segments are loaded first. We have been running clusters with this > strategy in our clusters for a while now and it has helped our large (order > of a few TBs) ingest use cases quite well. > > The second PR is for handling unknown complex types: > https://github.com/apache/druid/pull/9422 > > Recently, while upgrading our cluster, we ran into an issue where the Druid > SQL functionality broke because an incompatible change was made in an > aggregator extension. While we obviously shouldn't be making any > incompatible changes, it doesn't hurt to guard against it (especially for > folks building in-house Druid extensions) and especially preventing it from > a major functionality like Druid SQL in this case. > > The third PR I actually raised today. But would be good to bring to > community's attention as I believe it addresses a long standing issue. > https://github.com/apache/druid/pull/9877 > Internally, and I would be surprised if it isn't common out there, we have > lots of hive parquet tables that have the timestamp column of type int > storing the time in the format yyyyMMdd. To ingest such a column as Druid > timestamp, one would expect that specifying a date time format like > "yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid > ignores the format when it sees that column is numeric and instead > interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends > up being interpreted as 20200521 milliseconds which corresponds to the > incorrect datetime value of "Thu Jan 01 1970 05:36:40". > > Thanks, > Samarth >