Apache Pinot Daily Email Digest (2021-01-22)

Pinot Slack Email Digest Fri, 22 Jan 2021 18:00:35 -0800

#general

@niels.it.berglund: @niels.it.berglund has joined the channel
@aishee: @aishee has joined the channel
@mikexia: @mikexia has joined the channel
@humengyuk18: Hi team, when using upsert with realtime table, can we do segments compaction or merge for committed segments? Like merge multiple small segments into one large segment. If not, how should we deal with too many small segments when using upsert? @yupeng @jackie.jxt
@yupeng: upsert in pinot does not use compact, but use metadata to track the records of the same key. you can find the design details in this doc
@yupeng: segment size can be controlled via threshold, which is separate
@humengyuk18: I see, so in the current design, there is no way to merge multiple small segments into larger segments for upsert table? We can only control segments size during ingestion?
@yupeng: that’s right. take a look at these configs
@yupeng: ```"realtime.segment.flush.threshold.size": "0", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.desired.size": "50M",```
@jackie.jxt: These are two separate topics. Segment merge is not supported yet, and the feature is on our roadmap
@jackie.jxt: Once segment merge is supported, we should be able to merge segments for upsert table
@wrbriggs: Do star-tree index aggregations span segments, or are the aggregates computed per segment?
@g.kishore: per segment
@g.kishore: have been thinking of extending it to span multiple segments - do you have a concrete case?
@wrbriggs: Awesome, thank you. So theoretically, I should be able to make use of star-tree aggregates without a time column, and Pinot will (or could) merge the aggregates on relevant segments after partition pruning?
@g.kishore: yes
@wrbriggs: My use case actually is better if they are per-segment
@vadlamani1729: @vadlamani1729 has joined the channel

#random

@niels.it.berglund: @niels.it.berglund has joined the channel
@aishee: @aishee has joined the channel
@mikexia: @mikexia has joined the channel
@vadlamani1729: @vadlamani1729 has joined the channel

#troubleshooting

@niels.it.berglund: @niels.it.berglund has joined the channel
@aishee: @aishee has joined the channel
@contact: Hey, quick question: we have realtime segment marked as completed and we would like to move it to a offline table however the endpoint to download the segment (`get /segments/{tableName}/{segmentName}` )is trying to fetch it from the deep store. I was just thinking of downloading it and upload it on the offline table directly, how could i achieve this ? Thanks
@wrbriggs:
@wrbriggs:
@wrbriggs: @contact This should be possible using a minion based on the second link
@contact: Yeah i saw that but i don't really want the minion to rebuild the segment, i just want to move them as-is
@contact: Is there a way to do this @wrbriggs ?
@contact: I'm actually deep into how the task works and i'm seeing `realtimeSegmentZKMetadata.getDownloadUrl()` which i didn't follow yet
@contact: but i guess i could find my response there ?
@wrbriggs: I’m not sure - the tricky part is that the offline table might not have the same partitioning / sorting / indexing as the realtime table, and the *`RealtimeToOfflineSegmentsTask`* handles that generically for you - by simply moving the segments as-is, you are kind of shoehorning yourself into never diverging the offline table.
@contact: Well i configured the realtime table the exact same as the offline so i should be fine right ?
@wrbriggs: For the short-term, yes
@contact: What would be the problem in the long term ? I mean if i have an issue i can just re-index the segment and re-upload it ?
@wrbriggs: It’s not an approach I would put into production, though
@contact: From my comprehension the only difference between realtime and offline when a segment is completed would be that the realtime stores it locally but the offline does it in the deepstore
@contact: I guess i'm missing something ?
@wrbriggs: Realtime tables also push segments to the deep store
@contact: Hmmm, thats surely something i missed
@contact: Is it automatic when the segment is completed ?
@wrbriggs: :
@contact: Thanks, something that i haven't mentionned is that my stream are high level. I'm seeing in the task code that it only works with low level
@wrbriggs: See this as well:
@wrbriggs: Ah, I have no experience using the high level stream consumer, unfortunately. I went straight to low-level based on the limitations of the high level streams.
@contact: Thanks anyway, you were very helpful
@npawar: We don't recommend or maintain high level any more. Any particular reason you are using high level?
@npawar: Simply moving segments to offline table by downloading and reuploading can give you incorrect results in the time boundary calculation at the brokers
@contact: @npawar We are currently using GCP's pubsub system for pinot and it doesnt have any "partition" system
@contact: You only get one subscription for every consumer
@mikexia: @mikexia has joined the channel
@wrbriggs: I’m running into a situation where Pinot is using a star-tree index to satisfy a query in one case, but not in another, and the queries are almost identical. This one does not use the star tree: ```SELECT dimension, SUM(metric) AS totalMetrics FROM myTable WHERE otherDimension='filterValue' AND eventTimestamp >= cast(now() - 172800000 as long) GROUP BY 1 ORDER BY 2 DESC LIMIT 10``` This one uses the star tree: ```SELECT dimension, SUM(metric) AS totalMetrics FROM myTable WHERE otherDimension='filterValue' AND eventTimestamp >= 1611161288000 GROUP BY 1 ORDER BY 2 DESC LIMIT 10``` It looks like the use of a dynamically-computed timestamp value is confusing the optimizer somehow? the `eventTimestamp` column is not part of my star-tree index in either case.
@mayanks: How did you find out that StarTree was used for one query?
@wrbriggs: Trace
@mayanks: You are right, the code to check if star-tree can be used or not does not handle the _expression_ `cast(now() - 172800000 as long)`. You can file an issue for the same.
@wrbriggs: @mayanks I’m new to the Pinot codebase, but if you can point me in the right direction, I’d be happy to see if I can put together a PR to address this.
@mayanks: Nice
@mayanks: Let me post a pointer here
@mayanks:
@mayanks: Note, after optimization lhs and rhs for the time predicate are swapped
@wrbriggs: @mayanks Correct me if I’m wrong, but based on a quick look, it appears that the problem might actually be in `extractPredicateEvaluatorsMap` , and not in `isFitForStarTree`?
@mayanks: yes
@mayanks: `isFitForStarTree` is the high level entry point for you to get the algorithm. The code that passes the predicateColumns is the one that needs to be checked.
@wrbriggs: Got it, perfect, and thank you.
@mayanks: You'll jump a few classes from this entry point to go to the exact place that needs fix, but you'll get a better picture of what is going on
@wrbriggs: Yup, looks like the various Aggregation*PlanNode instances. I’ll spend some time on it after I get done my day job, thanks again!
@mayanks: They'll probably lead you to ```StarTreeUtils.extractPredicateEvaluatorsMap```
@wrbriggs: Yeah, they did, that was what I referenced above :slightly_smiling_face:
@mayanks: Oh yeah you did
@wrbriggs: @jackie.jxt Saw you commented on the time column predicate issue… wanted to tag you here for context.
@jackie.jxt: @wrbriggs I think the problem here is that `1611161288000` is in micros instead of millis
@jackie.jxt: Then everything is pruned out
@mayanks: ```I did a simple test to compile the query to BrokerRequest, and do see cast(now() - 172800000 as long) as LHS for predicate. My test did not actually go through any query execution.```
@wrbriggs: @jackie.jxt 1611161288000 is ms since epoch
@jackie.jxt: Oh, sorry my bad
@jackie.jxt: The problem is not how the predicate is evaluated, but why the first query is not converted to the second query on the broker side
@mayanks: Because of use of a function (now()) as opposed to a constant eval? (just guessing)
@jackie.jxt: Seems working without the cast
@wrbriggs: yes, it seems to be the cast. The only reason the cast is necessary is being addressed here:
@mayanks: Yeah, I am working with @amrish.k.lal on ^^
@wrbriggs: Yes, I was just thanking him earlier because I noticed that PR :slightly_smiling_face:
@wrbriggs: I didn’t realize these were so closely intertwined, though
@mayanks: I have concerns with this PR, so we would probably find other ways to fix the problem this PR is trying to address.
@jackie.jxt: The cast is not invoked on broker side because cast is not registered as a scalar function
@wrbriggs: So presumably there’s something buried in the `PinotQuery2BrokerRequestConverter` (or thereabouts) that is reifying eligible expressions into constants (e.g., `now()`) before handing the query off from the Broker for execution, and it isn’t handling `cast` function expressions?
@wrbriggs: ah
@jackie.jxt: We can add a scalar function for cast, then it should work on broker side (compile time)
@pabraham.usa: Hello, is there any way to delete the tableconfig without deleting segments and create the same table with the same segments from the disk for realtime? I happen to execute a wrong clusterconfig rest call and broke the broker UI . I tried updating it again but no luck. So planning to recreate the tableconfig without losing data.
@mayanks: No that I am aware of (for realtime). What exactly broke in the broker?
@pabraham.usa: The Cluster Manager section stopped loading. I can see some js error in the _javascript_ console. All ingestion and searches are working fine.
@mayanks: Restart controller?
@pabraham.usa: not helping, seems like the bad config is somehow coming from zookeeper.
@mayanks: Yeah, can't think of an easy fix
@pabraham.usa: will do the hard way.
@jackie.jxt: If you know which zk record is breaking, you may manually fix it via the zookeeper browser (I assume this is the hard way)
@mayanks: Hard way was delete and recreate table, I think
@ssubrama: For offline segments, you an re-load them from the Deleted_Segments folder in your PinotFS after recreating the table. For realtime tables, you can consume from earliest available row in the stream. Note that when it starts to consume feverishly, query performance will suffer. Also, if the data has already been retained out from the underlying stream, then it is gone forever, sorry.
@pabraham.usa: @jackie.jxt @mayanks correct delete and recreate unfortunately.
@pabraham.usa: As it seems its not straightforward to edit zookeeper data. @ssubrama it is good to know that I can restore from Deleted_Segments folder. Do you have any link related to this. Also to fetch data from stream I may have to somehow specify the offset to start with right? Where can can specify that?
@pabraham.usa: @ssubrama it actually caught up with the stream from the beginning. Which is good. However the restore took some time. Restoring from Deleted_Segments sounds like the fastest option.
@ssubrama: Note that this restore option is only for offline tables. You can check the root directory of your PinotFS (underwhich there are table directories), and there should be a folder for deleted segments
@pabraham.usa: ohh ok , Is it possible to copy segments created by realtime manually to Deleted_Segments folder then create a offline table from it. Then update the offline table config to realtime?
@ssubrama: Well, the realtime segments are also available in Deleted_Segments folder. But they cannot be uploaded into the table in any easy manner. If it is a production issue ($$$ at stake) , you can probably recover, but it needs a lot of manual work.
@pabraham.usa: Actually my stream retention matches pinot retention so there is no data loss. However what is the best way to do proper backup and restore ? is S3 or EFS based deep store a good way to go?
@vadlamani1729: @vadlamani1729 has joined the channel
@elon.azoulay: Does anyone here impose a limit on `pinot.broker.query.response.limit` - we are thinking to limit to 1k, and were wondering what other pinot installations use.
@jackie.jxt: This config is used to prevent super expensive queries exhaust the resource on the servers. Based on your use case, you may choose the value accordingly. If normally you won't run query with limit higher than 1000, you can set it to 1000, and it will bound the limit to 1000

#query-latency

@falexvr: @falexvr has joined the channel
@falexvr: @falexvr has left the channel

#pinot-dev

@luanmorenomaciel: @elon.azoulay and @mayanks any yaml file that you could share with me regarding schema registry integration
@elon.azoulay: For the stream configs here is an example: ```"streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "LowLevel", "stream.kafka.topic.name": "<my topic>", "stream.kafka.broker.list": "<my broker host>:9092", "realtime.segment.flush.threshold.time": "6h", "realtime.segment.flush.threshold.size": "0", "realtime.segment.flush.desired.size": "200M", "stream.kafka.consumer.prop.isolation.level": "read_committed", "stream.kafka.consumer.prop.auto.offset.reset": "smallest", "stream.kafka.consumer.prop.group.id": "<a uuid>", "stream.kafka.consumer.prop.client.id": "<another uuid>", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder", "stream.kafka.decoder.prop.schema.registry.rest.url": "http://<schema registry>:8081" }```
@elon.azoulay: We do not use ssl but I believe @mayanks might have more context about that.
@luanmorenomaciel: that's great @elon.azoulay I'll try to implement and let you know?

#webex

@moonbow: @moonbow has joined the channel
@moonbow: @moonbow set the channel purpose: Cisco Webex
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]