Apache Pinot Daily Email Digest (2021-06-22)

Pinot Slack Email Digest Tue, 22 Jun 2021 19:00:40 -0700

#general

@leeon2013: @leeon2013 has joined the channel
@s.azimigehraz: @s.azimigehraz has joined the channel
@bharath.sbk: @bharath.sbk has joined the channel
@huzongxing0826: @huzongxing0826 has joined the channel
@hemanga.borah: @hemanga.borah has joined the channel
@neilteng233: Hey, I have a query like this, will star-tree index recognize range filter? (all date are truncated to day granularity) ```SELECT approx_distinct(id) AS "count" FROM table WHERE start_date <= current_date() AND end_date >= current_date()``` BTW, what about where cat in ('a', 'b', 'c')?
@mayanks: Yes, will update the doc
@mayanks:
@karinwolok1: Starting in 15 minutes! :slightly_smiling_face: @ken @jackie.jxt
@gqian3: Hi, is there some default limit for string dimensions field in Pinot, we are seeing some partial string returned from field queries.
@mayanks: Yes I think it defaults to 512 bytes. But you can overwrite it in the schema.
@mayanks: Will update the docs
@mayanks: Something like: ``` { "dataType": "STRING", "maxLength": 1000, "name": "textDim1" },```
@gqian3: Got it, thank you.
@gqian3: Can we update this maxlength on the live data schema? Or we have to recreate the table?
@mayanks: If you want to fix data that is pushed, you will have to backfill.
@mayanks:
@jiangok2006: @jiangok2006 has joined the channel
@jackie.jxt: Correction to the forward index reader optimization availability in the meetup talk: it is available in 0.7.1, not 0.6.0
@mayanks: @keweishang ^^
@keweishang: Thanks for the update
@egalpin: @egalpin has joined the channel
@evan.galpin: @evan.galpin has joined the channel
@evan.galpin: hi folks :slightly_smiling_face: Good to be here! I’d really love to learn more about the capabilities of the star-tree index. In particular, I’m curious to know how it might enable ingesting raw data and creating “materialized views” for specific use cases later once they are known. This might be considered an anti-pattern for Pinot, and if so that would be good to know too :+1:
@mayanks: Currently, starTree index is created based on user configs;
@kennybastani: Hi Evan! It's definitely not an anti-pattern. You've come to the right place. Creating materialized views by transforming raw data sources can be done using a variety of techniques during ingestion. Specifically, transform functions can mutate your raw data using Groovy scripts. We have a variety of other transform capabilities worth checking out in the docs. For star-tree index, that's a good way to speed up your query responses once you've figured out what your materialized view is going to look like.
@kennybastani: What does your domain data look like? Are you looking to do metric aggregation or more focused on building query models with multiple dimensions?
@evan.galpin: @kennybastani that’s interesting to know RE Groovy scripts. Thinking about it from a traditional RDBMS perspective and using a classic “blog posts” example, would it make sense to ingest `posts` , `comments`, `likes` etc as distinct data sets, then later make use of Groovy scripts to effectively join those data sets at ingestion time (rather than search time)?
@kennybastani: There's multiple strategies here. If you're streaming your data to a real-time table using Kafka, you have two potential approaches. You can implement your messaging at the application-level, which would send your domain objects as a payload to three separate topics. At that point, you will need to use something like Flink to join those three streams together into a single materialized view (rather simple in practice). The joined materialized view is then sent to a view topic that is ingested into Pinot and ready for query.
@kennybastani: At the point where your materialized view gets ingested, you can do more transform functions and pre-aggregations as you stream in. This can be helpful if you need to mutate your materialized view for different consumers or dashboards (multiple tables but same joined view from RDBMS).
@kennybastani: There's also an emerging practice that is much more preferable when it comes to implementation and maintenance down the road. You can use CDC to stream out data changes at the database level on a per table basis. So, your `posts`, `comments`, `likes`, as they sit in different tables... whenever they are updated, deleted, created, a Kafka event is sent per table to a respective topic. Then the rest is as I said before with Flink.
@kennybastani: Does that explanation help?
@kennybastani: (Also, with upsert, you can make sure that only the most current version of a domain object is made available for query using SQL on Pinot)
@evan.galpin: ok ya, that makes sense using an ETL platform to do the work of joining data sets. I suppose my concern is more in the area of product evolution and the maintenance that would go along with it. For example with an RDBMS, maybe a new table is added later on resulting in a new dimension for `posts`. To then start making use of the new dimension, the new table can be joined; it might not be very efficient but it can start answering questions about the data right away. And testing/local development can be done relatively cheaply by inserting data and joining at query time. What does the developer workflow look like to support the same kind of feature evolution in Pinot? It seems complex to mimic an ETL pipeline for local development, for example
@kennybastani: Just double checking with Mayank to make sure I answer this right. One sec.
@kennybastani: Example: • `fooTable` has columns `a, b, c` in the Pinot schema configuration, as well as `primaryKey` • Upsert is enabled and partitioned on the `primaryKey` • The table is real-time and has been populated with 1,000 records • Now I change the Pinot schema to `a, b, c, d` • The Kafka payload has been modified to stream in the new column for `d` • To make queries return correctly after making this change in Pinot, you need to issue a `reload` on the segments of the `fooTable` • This will populate the `d` column with the value null for the 1,000 existing rows • Because upsert is enabled, when you populate the `d` column in your RDBMS, the old 1,000 rows will be updated with the current version of the `d` value
@kennybastani: So, with that workflow in-mind. Your total work is to update the schema of your RDBMS table that is configured to use CDC to stream record updates to Kafka. A simple modification of the Pinot schema configuration adds the new field from the database. To operationalize the change, you simply reload the segments on your table (a simple and safe command). Then you're off to the races.
@kennybastani: Does that make sense?
@jiangok2006: I am new to pinot. I read online and feel that people choose pinot over druid is because it has better perf. heavily compares clickhouse VS druid/pinot as opposed to druid VS pinot. I know startree data structure is unique to pinot but do not have a sense how much it help pinot win the race. Which tool is better in what scenario. Or one is obviously better than another in general. Could pinot guru shed some light? I love to hear the pinot advantages mapped to key design differences. Thanks very much.
@ken: Hi @jiangok2006 - it would help a lot to include some details of your use case. E.g. batch vs. real-time vs mixed, data volume & velocity, how Pinot will be used (e.g. backend for dashboard, or something else), etc. More context will mean much better answers.
@jiangok2006: Thanks @ken. Since druid and pinot are not good at join (my info may be out of dated), it might be good to use snowflake as the full fledged datawarehouse and tolerate longer query response. I plan to use pinot/druid in streaming. The data coming from kafka is ingested into pinot/druid so that people can directly query/visualize the streaming data in sub-second delay. Data volume is around 10K messages per second and each message is around 2K bytes. Hope this clarifies.
@dlavoie: Pinot will shine as you ramp up high QPS. It’s scabilility models offers better linearity as you reach high ingestion and qps rates.
@ken: Typically you’d use Presto on top of Pinot to support joins, or denormalize (flatten) data to remove the need to join
@jiangok2006: Thanks guys. This is really helpful.
@g.kishore:
@g.kishore: This covers some of the performance difference
@g.kishore: in terms of features, Pinot has a powerful indexing techniques that help you achieve low latency at high throughput • Inverted (most systems have this) • Sorted Index (similar to BTree) • Range Index • Text Index • JSON Index • Geo Spatial Index • StarTree Index
@neilteng233: I am wondering if my time dimension is in millisecond granularity, how it will be used in star-tree? Should I truncate it to day or week first? P.S. I see that star tree will automatically include dictionary-encoded Time/DateTime columns to the dimensionsSplitOrder property.
@mayanks: If you have different millisecond value for each row, then you are preventing pre-aggregation. At this point, you need to ask if the application really needs to slice/dice at millis level? If not, then you can either use a different time unit (say days), or you can snap the millis for the entire day to the beginning of the day or some fixed value.
@neilteng233: I understand. Because I see this in the doc, I am wondering will the _*enableDefaultStarTree append millisecond cols to the dimension?*_
@neilteng233: All dictionary-encoded Time/DateTime columns will be appended to the _dimensionsSplitOrder_ following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the _dimensionsSplitOrder_.

#random

#troubleshooting

#pinot-dev

@ithinkthereforeicode: @ithinkthereforeicode has joined the channel
@dougdeu: @dougdeu has joined the channel

#getting-started

@huzongxing0826: @huzongxing0826 has joined the channel
@dougdeu: @dougdeu has joined the channel

#flink-pinot-connector

@huzongxing0826: @huzongxing0826 has joined the channel
@huzongxing0826: @huzongxing0826 has left the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]