Apache Pinot Daily Email Digest (2022-04-20)

Pinot Slack Email Digest Wed, 20 Apr 2022 19:00:41 -0700

#general

@abhijeet.singh: @abhijeet.singh has joined the channel
@m.akash2209: @m.akash2209 has joined the channel
@yahya.zuberi: @yahya.zuberi has joined the channel
@harish.bohara: I have a basic question on Pinot search (I am sure Pinot will have it - Just want to know internals of how it does it) - suppose I have thousands of segment files over long time. How does a query avoids looking at all these files to give a query result- e.g. if i query for a data for time range. Does Pinot know which segment files to read to fulfil this query. Any detailed doc will help which can explain this..
@kharekartik: @richard892
@francois: I will give you what I’ve understood :wink: Segment can be seen as partition and each segment have his own indexes to provide fast and reliable response. In addition to that segment pruning can be added to pre-filters witch segment to query based on a query filters predicates you time rage for exemple. Pinot expert will give you more detail from the inside of course :slightly_smiling_face: Still a newbie
@richard892: there's lots to say about this topic but the keyword to search for is "segment pruning" - best to start here
@parthiv.shah.777: @parthiv.shah.777 has joined the channel
@yahya.zuberi: Hey everyone, So we are exploring to use GRAPHQL on pinot. Any experiences, recommendations and challenges that anyone have please share that can make our POC helpful. Regards
@g.kishore: Are you planning to use any tool/library or writing custom handler to convert graphql to Pinot sql query
@yahya.zuberi: We were planning to do it manually are there any tools/library available @g.kishore
@g.kishore: Not that we are aware of.. if you plan to do it in open source, we will be happy to guide/review.
@saumya2700: hi All, I have a question about what is the best way to create segments, like I have realtime table which is ingesting data from kafka topic and topic has deviceid as key, should we create segments as per device id or it should be based on default time based segments, our queries mostly have searches for device id and time range. If we create segments as per device Id will the same device id data go in same segment and if it is like that query will be faster and will it look only for segments which has these deviceIds, how will it work.
@g.kishore: Pinot supports both time and space partitioning
@g.kishore: In most cases, it’s better to partition by time first (day) and then space(deviceid)
@saumya2700: any sample , or document where both are used . I am struggling how to do it for both what config is required for that.
@g.kishore:
@g.kishore: Time partition happens automatically.. it segment metadata has start time and end time which get as used during pruning
@iladro: @iladro has joined the channel
@tim58jsy: @tim58jsy has joined the channel

#random

#troubleshooting

@abhijeet.singh: @abhijeet.singh has joined the channel
@m.akash2209: @m.akash2209 has joined the channel
@yahya.zuberi: @yahya.zuberi has joined the channel
@parthiv.shah.777: @parthiv.shah.777 has joined the channel
@yeongjukang: Hello folks, I have some questions about memory consumption of server instances. • How much RAM will be taken(roughly estimated) by a segment if I ingest 687MB csv, which generates 109MB segment gzipped tar file? • I have 5 server nodes and each has 4272/3884/4438/3493/3661 segments. They took 27/8/6/3/18 Gi RAM each. What makes them different from each other? kind of raw data cache? Thanks in advance.
@richard892: hi, the first question is impossible to answer, it depends on the data type, the cardinality of the columns, the compression configuration, what indexes you have and so on
@richard892: as for the second question, this is because of partitioning, what column are you partitioning on, and is there skew in the distribution of number of records or of the size of the records?
@yeongjukang: @richard892 Thanks for reply. I didn't specify partition option for tables. About distribution, records seem to be distributed appropriately.
@yeongjukang: About size consumption, could you give me a case you know or a public example?
@yeongjukang: Additionally, I am running servers on k8s cluster. After deleting 27Gi RAM usage pod, it took only 13Gi RAM. Could it be another clue for this?
@diogo.baeder: Hey guys, sorry to ask something about Trino here, but it's related to Pinot and unfortunately I couldn't get help from anyone at their community. I'm doing an experiment with Trino and Pinot, and I noticed that a query I do to different tables in Pinot ends up with Trino actually querying all of the existing segments in Pinot, completely bypassing any partitioning I defined for my tables. Is this expected? Has somebody here ever experienced this as well, but solved it? I'm thinking about using dynamic tables in Trino to work around that issue, but it just feels dirty to have to do that...
@g.kishore: whats the query? @elon.azoulay can provide more info.
@diogo.baeder: Sorry for the delay; Here's an example query I run, which involves querying inside a specific period of time, but Trino bypasses that and instead queries all segments regardless of the dates: ```select distinct weight from weights inner join brands_metrics on weights.pmxid = brands_metrics.pmxid inner join filters on weights.pmxid = filters.pmxid where brand_id = 1000226 and filters.name = 'gender' and contains(filters."values", 2) and weights.date > 20130501 and weights.date < 20130515 and brands_metrics.date > 20130501 and brands_metrics.date < 20130515 and filters.date > 20130501 and filters.date < 20130515``` here, `brands_metrics`, `weights` and `filters` are all tables that contain a `date` column set as the time column upon table creation.
@iladro: @iladro has joined the channel
@tim58jsy: @tim58jsy has joined the channel
@abhijeet.kushe: I am using Pinot 0.9.1 I wanted to know how order by works The below query returns a response ```select taskName, taskResult, distinctcount(personId) from events where accountId = 1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by taskName, taskResult order by mode(createdOn) asc limit 10000 ``` but this throws an exception ```select taskName, taskResult, distinctcount(personId) from events where accountId = 1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by taskName, taskResult order by createdOn asc limit 10000 ``` ```[ { "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:230)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:120)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)", "errorCode": 200 } ]``` `mode(createdOn) asc` is the difference what makes it work ..is this a bug ?
@npawar: you’d have to explicitly add createdOn into the group by clause. Otherwise if you group as (taskName, taskResult) -> (distinctCount(personId)) what is the created on that should we considered? there can be several in each group rt?
@abhijeet.kushe: So the reason why I need to add order by to the query is so that `taskName,taskResult ,distinctcount` will be sorted by createdOn in other words the first taskName and taskResult will be displayed first in the output …I want a consistent or a guaranteed sort order either ascending or descending. I don’t want to group by each createdOn timestamp
@npawar: in that case, can you put max(createdOn) in the selection, and order by that?
@npawar: ```select taskName, taskResult, distinctcount(personId),max(createdOn) from events where accountId = 1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by taskName, taskResult order by max(createdOn) desc limit 10000 ```
@abhijeet.kushe: @npawar thanks for your quick response.Above query works so does the below max(createdOn) needed in select clause ? ```select taskName, taskResult, distinctcount(personId) from events where accountId = 1100609261882 AND workId = '40d9652e-c543-4bd2-aa4d-a11c7b23a6df' group by taskName, taskResult order by max(createdOn) desc limit 10000 ```
@npawar: ah, yes, you dont need it in projection
@abhijeet.kushe: Awesome thanks

#getting-started

@abhijeet.singh: @abhijeet.singh has joined the channel
@m.akash2209: @m.akash2209 has joined the channel
@yahya.zuberi: @yahya.zuberi has joined the channel
@yahya.zuberi: Hello, need to connect pinot with our java/kotlin spring boot application 1. Do we have pooling options while using JDBC
@kharekartik: Hi, No we don't support pooling in our JDBC client currently. You can create a ticket for it on Will be happy to take it up
@kharekartik: You can also use our Java client (non-JDBC) which provides Async query methods.
@yahya.zuberi: Hi Kartik, can we manually create pool? @kharekartik
@ysuo: Hi, I have a tricky question. “NAN” values are stored in my realtime type table. My table config is like below. And the value in Kafka message is like {“key”:{“price_value”:“NAN”}}. Is there some way I can use to filter this NAN messages in the ingestion phrase or in the query phrase? “metricFieldSpecs”: [ { “name”: “price_value”, “dataType”: “DOUBLE” } ]
@kharekartik: Hi You can use `filterConfigs` in the table for this
@parthiv.shah.777: @parthiv.shah.777 has joined the channel
@iladro: @iladro has joined the channel
@tim58jsy: @tim58jsy has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2022-04-20)

#general

#random

#troubleshooting

#getting-started

Reply via email to