Apache Pinot Daily Email Digest (2022-05-09)

Pinot Slack Email Digest Mon, 09 May 2022 19:01:10 -0700

#general

@akizminet: @akizminet has joined the channel
@g.s.thadani: @g.s.thadani has joined the channel
@diogo.baeder: Hey guys, I'd like to ask a question which is not really a problem, but rather just a curiosity on how an aspect of the system works: every time I spin up Pinot with my docker-compose, create the tables, add data and query it for the first time, it does't query as fast as I'd like, but then right on the second and subsequent queries it gets blazing fast, even if I change many constraints in my query. I know that Pinot doesn't do "caching", but why is there such a big difference in query times? For example, it may drop from 900ms on the first query to 40ms, 30ms or even lower on the second, third, fourth etc queries.
@mayanks: Pinot memory maps segments, so there’s an initial warmup.
@diogo.baeder: Ah, got it. And that happens on the Server, I presume? And for how long does it keep them mapped?
@mitchellh: as long as possible. One of the fun parts of mmap is it doesn't cost much unless the segment is pulled into actual ram.
@mitchellh: the JVM & linux kernel, and pinot its self, do a good job of bringing/keeping segments into memory via mmap and removing them when necessary.
@diogo.baeder: Got it. Nice! But what determines that the file should be kept there? Like, when should I expect it not to be there mapped anymore?
@mitchellh: is the best explanation I've seen on this. The TLDR is "when necessary"
@mitchellh: necessary meaning something else could use that ram more often than what's currently using it
@diogo.baeder: Ah, so it's more determined by the OS than Pinot itself, got it. Thanks man! :slightly_smiling_face:
@nadagoub: @nadagoub has joined the channel
@rodseidel: @rodseidel has joined the channel
@msharma: @msharma has joined the channel
@mayanks: This is a great talk from Cisco Webex team on how they evaluated Apache Pinot against other systems, please feel free to sign up:

#random

#troubleshooting

@akizminet: @akizminet has joined the channel
@g.s.thadani: @g.s.thadani has joined the channel
@deemish2: Hi Team, I am trying to execute pinot ingestion job with segment type ‘fixed’ . Input data is on different directory - dir Batch1 - part-**.avro dir Batch2 - part-00**.avro etc . I would like to generate segment with fixed type segment name e.g. - segment name - Batch1 , segment name - Batch2 etc. Can any one please help with the same?
@diogo.baeder: How are you doing ingestion? Through preconfigured batch jobs?
@diogo.baeder: If yes, then you could use this in your job config: ```segmentNameGenerator: type: fixed configs: segment.name: my_segment_name```
@deemish2: yes, but it will create single segment after taking all the avro file. My requirement is to create segment per directory. e.g. segment name should be batch1 , batch2 etc.
@diogo.baeder: Will the directories be fixed? Or will them be dynamic, like, in the future you want a "Batch3" directory and want to automatically consume that as a segment? If they're fixed, it's just a matter of configuring new jobs; If they're dynamic, then I don't know how to solve that...
@mayanks: Do you need single segment for each directory? What if the directory has too much data that it doesn’t fit in one segment? Or are you suggesting batch1 as prefix?
@prashant.pandey: Hi team, how does Pinot encode byte columns to display on the UI? Are they encoded as hex strings?
@kharekartik: Yes, they are encoded as Hex strings. ```/** * Converts the byte array to a Hex encoded string. * * @param bytes byte array * @return Hex encoded string */ public static String toHexString(byte[] bytes) { return Hex.encodeHexString(bytes); }```
@prashant.pandey: Ah, danke :slightly_smiling_face:
@nadagoub: @nadagoub has joined the channel
@luisfernandez: anyone know why in my pinot metrics when i send a query i get `"numServersQueried": 4,` but i only have 2 servers o.O ?
@mayanks: Offline and real-time ?
@luisfernandez: ah forget i even said anything
@mayanks: Said what :grinning:?
@tiger: Hi, I noticed some strange behavior when setting `realtime.segment.flush.threshold.rows` for my realtime tables. It seems that the actual number of rows per segment becomes some value smaller than the value I set. For example, I'll set this to 1000000, but in the segment metadata, `segment.flush.threshold.size` would be 500000 and the segment does only ingest 500000 rows. This seems to only happen for some tables, and sometimes it is shrunk by a factor 2 or 4. Just wondering if there is any other setting I'm missing that is causing this?
@mayanks: Can you check if it gets divided by number of partitions?
@tiger: Oh that might be it. In one example, I have about 20 partitions and it gets divided by 4. Is this expected?
@mayanks: Yes if you have each server consuming 4 partitions
@mayanks: If the doc is not explicit enough we should fix that cc: @mark.needham
@tiger: I see, does the number include replicas as well?
@tiger: I couldn't find anything in the docs explaining this
@mayanks: It boils down to how many partitions a server has to consume (function of how many partitions there are, how many servers, and what’s the replication)
@tiger: got it, thanks!
@tiger: Does this also apply to setting `realtime.segment.flush.threshold.segment.size` ? it looks like when I use that, alongside `realtime.segment.flush.autotune.initialRows` , it actually makes each segment that size.
@mayanks: No, what I mentioned above only applies to `realtime.segment.flush.threshold.rows`
@luisfernandez: hey friends, question, why do we consider queries that take more than 100ms to be ? in our current cluster we have some queries that are taking more than 100ms to execute is that a reason to be worried?
@mayanks: No, every use case is different and can have different SLA. This is just some historic threshold for logging.
@luisfernandez: lol it does say TODO:// make it configurable
@mayanks: Yeah, really old TODO :wink:
@rodseidel: @rodseidel has joined the channel
@msharma: @msharma has joined the channel
@abhinav.wagle1: Hello, Any pointers while I am using Pinot Helm here on AWS, and how to make Pinot controller Load balancer url accessible only inside VPN while I deploy this on AWS.
@mayanks: @xiangfu0 ^^

#getting-started

@akizminet: @akizminet has joined the channel
@g.s.thadani: @g.s.thadani has joined the channel
@ysuo: Hi, I’m using Pinot upsert mode for one table and already stored some days’ data. If I change primaryKeyColumns config in schema, such as adding or deleting some filed, do I need to delete and upload this schema and table config to make this change take effect?
@jackie.jxt: @ysuo Currently upsert table is able to re-construct the metadata on primary key changes. You may upload the new schema and restart the servers for the table to apply the changes
@kharekartik: I wasn't aware of that. Thanks a lot! Deleting my previous comment to not cause any confusion.
@ysuo: Got it. Thanks.
@nadagoub: @nadagoub has joined the channel
@rodseidel: @rodseidel has joined the channel
@msharma: @msharma has joined the channel

#introductions

@akizminet: @akizminet has joined the channel
@g.s.thadani: @g.s.thadani has joined the channel
@nadagoub: @nadagoub has joined the channel
@rodseidel: @rodseidel has joined the channel
@msharma: @msharma has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org