Apache Pinot Daily Email Digest (2021-12-08)

Pinot Slack Email Digest Wed, 08 Dec 2021 18:00:32 -0800

#general

@sania: @sania has joined the channel
@javier: @javier has joined the channel
@donomar: @donomar has joined the channel
@tiger: Is there a way to set up the access configurations to easily limit tables to certain users? It looks like right now we can only limit users to certain tables?
@mayanks: Pinot doesn't have RBAC today.
@mayanks: It does, however, has the interfaces you can use to plug in your access control
@tiger: got it, thanks!
@tiger: Is there an existing way to set different access for different tenants?
@tiger: @mayanks For adding a custom access control, is there a way to just add it as a Java plugin? Or will I have to build all of pinot?
@suraj: Hello - we are exploring storing metrics at higher granularities by rolling up the data at lower granularities. Ex: 1s metrics rolled up and stored at 1 min granularity. Does pinot support percentile aggregations ?
@mayanks: Yes, Pinot supports percentile, as well as approximate percentile (using TDigest). For your case, since you are rolling up you might explore percentileTDigest as it is an additive function.
@suraj: thank you will explorer that !
@acommike: @acommike has joined the channel

#random

@sania: @sania has joined the channel
@javier: @javier has joined the channel
@donomar: @donomar has joined the channel
@acommike: @acommike has joined the channel

#troubleshooting

@kangren.chia: there seems to be a hard limit of 1 million rows returned by pinot even with using `LIMIT` way beyond that; any way to remove this? currently using the `latest-jdk11` image for pinot on kubernetes unfortunately, i can’t seem to use `IN_SUBQUERY` to represent the userid set, so on the client side i break my pinot queries into 1. fetch userids query (using a GROUP BY + HAVING query). sometimes i may get more than a million user ids 2. do final query
@ahmednagwa6: guys do i need to add any configuration related to s3 for realtime table configs i make it before work but lost my change i did most of config related to cotroller and server but not sure if i still need to add something let my tabe use s3 as a deep store segment realtime.segment.download.url: s3 path
@dunithd: The following documentation has all the required configurations to use S3 as a deep storage.
@ahmednagwa6: I saw it but my question is there is any extra config you need to provide in the case of real-time table "I do not see it in any documentation"
@ahmednagwa6: Thank you so much, as I did this config carefully and now it is not working my realtime table. Also i knew that this conig is being appended to pinot-controller.conf but i have in the /var/pinot/,... bla bla i see other file in opt/pinot with same name that not have the s3 config appended correclt .. "I am not sure which config the pod use it "
@dunithd: Can you share your table config here?
@ahmednagwa6: next_cdc_realtime_intentions_table_config.json: |- { "tableName": "next_intentions_trial4", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "ts_ms", "timeType": "MILLISECONDS", "schemaName": "next_intentions_schema2", "replicasPerPartition": "1", "allowNullTimeValue": true, "completionConfig": { "completionMode": "DOWNLOAD" }, "segmentPushType": "APPEND" }, "tenants":{}, "tableIndexConfig": { "loadMode": "MMAP", "nullHandlingEnabled": true, "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "simple", "stream.kafka.topic.name": "my-topic", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.hlc.zk.connect.string": "pinot-zookeeper.pinot-quickstart.svc:2181", "stream.kafka.zk.broker.url": "pinot-zookeeper.pinot-quickstart.svc:2181", "stream.kafka.broker.list": "my-cluster-kafka-bootstrap.kafka.svc:9092", "realtime.segment.flush.threshold.time":"0h1m", "realtime.segment.flush.threshold.rows":"2", "outputDirURI": "", "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS", "input.fs.prop.region": "us-east-1", "realtime.segment.download.url": "", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }, "routing":{ "instanceSelectorType":"strictReplicaGroup" }, "metadata": { "customConfigs": {} }, "upsertConfig":{ "mode":"FULL", "hashFunction":"NONE" }, "ingestionConfig":{ "transformConfigs":[ { "columnName":"ts_ms", "transformFunction":"JSONPATHLONG(json_format(payload),'$.ts_ms')" } , { "columnName":"id", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.id')" }, { "columnName":"created", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.created')" }, { "columnName":"modified", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.modified')" }, { "columnName":"uuid", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.uuid')" }, { "columnName":"intention_ref", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.intention_ref')" }, { "columnName":"confirmed", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.confirmed')" }, { "columnName":"live", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.live')" }, { "columnName":"user_id", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.user_id')" }, { "columnName":"stream_operation", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.op')" } ] } }
@ahmednagwa6: `next_cdc_realtime_intentions_table_config.json: |-` { "tableName": "next_intentions_trial4", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "ts_ms", "timeType": "MILLISECONDS", "schemaName": "next_intentions_schema2", "replicasPerPartition": "1", "allowNullTimeValue": true, "completionConfig": { "completionMode": "DOWNLOAD" }, "segmentPushType": "APPEND" }, "tenants":{}, "tableIndexConfig": { "loadMode": "MMAP", "nullHandlingEnabled": true, "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "simple", "stream.kafka.topic.name": "my-topic", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.hlc.zk.connect.string": "pinot-zookeeper.pinot-quickstart.svc:2181", "stream.kafka.zk.broker.url": "pinot-zookeeper.pinot-quickstart.svc:2181", "stream.kafka.broker.list": "my-cluster-kafka-bootstrap.kafka.svc:9092", "realtime.segment.flush.threshold.time":"0h1m", "realtime.segment.flush.threshold.rows":"2", "outputDirURI": "", "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS", "input.fs.prop.region": "us-east-1", "realtime.segment.download.url": "", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }, "routing":{ "instanceSelectorType":"strictReplicaGroup" }, "metadata": { "customConfigs": {} }, "upsertConfig":{ "mode":"FULL", "hashFunction":"NONE" }, "ingestionConfig":{ "transformConfigs":[ { "columnName":"ts_ms", "transformFunction":"JSONPATHLONG(json_format(payload),'$.ts_ms')" } , { "columnName":"id", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.id')" }, { "columnName":"created", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.created')" }, { "columnName":"modified", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.modified')" }, { "columnName":"uuid", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.uuid')" }, { "columnName":"intention_ref", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.intention_ref')" }, { "columnName":"confirmed", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.confirmed')" }, { "columnName":"live", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.live')" }, { "columnName":"user_id", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.user_id')" }, { "columnName":"stream_operation", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.op')" } ] } }
@ahmednagwa6: this is a test config not recommended as having too many segments or too little segment is a tradeoff ..what i ask and focus on. Is there is any table config needed to use S3 as a deepstore for segments . things are not working smoothly using the the doc ... I use k8s latest pinot but having the config on values.yaml folder extra config key for controller and server related to S3 "Not giving me the expected result"
@mark.needham: so is the problem that you aren't seeing anything written to the S3 bucket/
@ahmednagwa6: yes also from the controller logs too
@ahmednagwa6: i do not find my bucker name or anything related to even security or so
@ahmednagwa6: also /opt/pinot/conf/pinot-controller.conf not have the configuration it only presented in /var/pinot/controller/config/pinot-controller.conf which make sense according the chart but i do not know which file is used as the conf to the controller
@ahmednagwa6: Thanks mark, i also want to mentioned that i made this work before but i lost my changes .. I was wondering if you have clear steps beside what is already provided
@mark.needham: so I've been reading through the code to make sure that the config suggested in that article are correct and it looks like they are. From what I can tell, with this config the server will send the segment file to the controller, which will then upload it to S3. So it seems like the problem we have right now is understanding in which file these config need to be configured.
@mark.needham: let me check where they are read from and reply back
@mark.needham: one way to pass the values through is via `StartController -config <file>` but let me see if there's another way that doesn't require a file
@ahmednagwa6: # Extra configs will be appended to pinot-controller.conf file extra: configs: |- .hostname=true controller.task.scheduler.enabled=true controller.data.dir= controller.local.temp.dir=/tmp/pinot-tmp-data/ pinot.controller.storage.factory.s3.disableAcl=false pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=region pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
@mark.needham: this is for the helm chart
@mark.needham: ?
@ahmednagwa6: yes i did so as following on k8s values.yaml
@ahmednagwa6: for both server and controller per this doc
@ahmednagwa6: the issue in the helm chart args: [ "StartController", "-configFileName", "/var/pinot/controller/config/pinot-controller.conf" ] it do what you mention
@mark.needham: yeh I don't think what I suggested will work very well for K8s
@ahmednagwa6: thats what i believe too as it's not working . Do you think i put the right table config needed for s3 ?
@mark.needham: what you have looks like it should work (to me at least)
@mark.needham: but obv it doesn't, so I guess it's not getting picked up for some reason
@mark.needham: when you look at `/var/pinot/controller/config/pinot-controller.conf"` you said it has the values you set right?
@ahmednagwa6: yes
@mark.needham: and you said there's nothing under `logs/pinot-all.log` on the controller?
@ahmednagwa6: no i meant the container logs kubectl logs -n=pinot-quickstart pinot-controller-0 -f
@mark.needham: are you able to check the logs on one of the controller pods?
@mark.needham: that should have more logging
@ahmednagwa6: Sure i willl check it now
@ahmednagwa6: i have this warning 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'realtime.segment.flush.threshold.rows' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'outputDirURI' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.hlc.zk.connect.string' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'realtime.segment.download.url' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.decoder.class.name' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'streamType' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'input.fs.className' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'input.fs.prop.region' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.consumer.type' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.broker.list' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'realtime.segment.flush.threshold.time' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.zk.broker.url' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.consumer.factory.class.name' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.consumer.prop.auto.offset.reset' was supplied but isn't a known config. 2021/12/08 14:18:23.444 WARN [ConsumerConfig] [pool-8-thread-4] The configuration 'stream.kafka.topic.name' was supplied but isn't a known config.
@ahmednagwa6: I have this warning. is it related to depreication or version issue ... but how is consuming my data correct from my kafka topic ?
@mark.needham: do you see this line anywhere? `Initializing PinotFSFactory`
@ahmednagwa6: nope mark :slightly_smiling_face:
@mark.needham: hmmm, ok
@mark.needham: how about: `Initializing SegmentFetcherFactory`
@ahmednagwa6: no as well
@mark.needham: hmmm, that's strange. In theory those messages should be there when the controller starts.
@mark.needham: e.g. this is on a locally started controller: ```2021/12/03 14:38:17.684 INFO [BaseControllerStarter] [main] Initializing PinotFSFactory 2021/12/03 14:38:17.687 INFO [BaseControllerStarter] [main] Initializing ControllerFilePathProvider 2021/12/03 14:38:17.689 INFO [ControllerFilePathProvider] [main] Data directory: file:/tmp/data/PinotController 2021/12/03 14:38:17.693 INFO [ControllerFilePathProvider] [main] Local temporary directory: /tmp/data/PinotController/192.168.144.3_9000```
@mark.needham: although I haven't defined any factories so it doesn't register anything
@mark.needham: do a grep for `"\[BaseControllerStarter"`
@mark.needham: see what it says is happening when the controller starts
@mark.needham: I'm not sure about the kafka config thing - those exact configs are used for a bunch of Pinot's QuickStart examples! :thinking_face:
@ahmednagwa6: kafka is working fine .. i had all streams of data being injested successfully also create segment ever two row entries as test
@ahmednagwa6: root@pinot-controller-0:/opt/pinot/logs# cat pinot-all.log | grep "\[BaseControllerStarter" root@pinot-controller-0:/opt/pinot/logs#
@ahmednagwa6: return nothing. i mean based on your discussion i can look and see why i do not see such messages and update you .. how is that sound use your intuition as a starting point ?
@ahmednagwa6: and update uou
@ahmednagwa6: you *
@mark.needham: I mean maybe they got truncated
@mark.needham: they would only be there right at the beginning
@mark.needham: when you started the controller
@mark.needham: I was wondering whether the S3 provider is being registered
@mark.needham: it should be based on the config you shared
@mark.needham: let me check the log messages around copying segments. Let's see if we can find anything
@ahmednagwa6: Thank you ! sounds good too
@mark.needham: do you find anything if you grep for `LLRealtimeSegmentDataManager`?
@mark.needham: you don't need to paste everything if you do
@mark.needham: just wanted to see if that finds something
@ahmednagwa6: yes only one message .
@mark.needham: what's it say?
@ahmednagwa6: starting with running frequency of 3000 secs
@mark.needham: ok for sanity's sake, you don't see anything if you grep for `HLRealtimeSegmentDataManager`
@ahmednagwa6: no
@ahmednagwa6: actually i grep for my bucket name on the log file and i see messages related to s3 for firstime
@ahmednagwa6: [SegmentDeletionManager] [grizzly-http-server-0] Moved segment next_intentions_trial4__0__9__20211208T1017Z from file:/var/pinot/controller/data, ler-data/next_intentions_trial4/next_intentions_trial4__0__9__20211208T1017Z to file:/var/pinot/controller/data, ns_trial4__0__9__20211208T1017Z
@mark.needham: that would be if you deleted a table at some pooint?
@mark.needham: I guess that was when you had it wired up and working!
@ahmednagwa6: No when had it wired up was on diff table and i still have its data on s3 .. But yes while testing config with this new intention bla table i did deleted it multible times
@mark.needham: oh I see
@mark.needham: the file name looks wrong
@mark.needham: it should be file:// or s3:///
@mark.needham: it seems to have concatenated them
@mark.needham: so it thinks our file type is local file somehow
@mark.needham: it's missing the s3 scheme
@mark.needham: one sec lemme see why
@ahmednagwa6: yes i recognized this as well but not know why ... Thank you *
@mark.needham: does that mean it's trying to write stuff to `file:/var/pinot/controller/data,`
@mark.needham: is that even a valid path. See if there's anything there?
@mark.needham: ```URI deletedSegmentDestURI = URIUtils.getUri(_dataDir, DELETED_SEGMENTS, rawTableName, URIUtils.encode(segmentId));``` segmentId = `next_intentions_trial4__0__9__20211208T1017Z` rawTableName = `next_intentions_trial4` dataDir = `file:/var/pinot/controller/data,`
@ahmednagwa6: well i see now two entry points controller.data.dir=/var/pinot/controller/data controller.data.dir=s3 one in the pinot-controller.conf
@mark.needham: ah - delete the `/var/pinot/controller/data` one
@ahmednagwa6: Yes now things are working properly .. Thank you so much @mark.needham yes in values.yaml file the chart take both the exrtra config and another default value for the data dir
@ahmednagwa6: Thanks so much again sir
@mark.needham: awesome! No worries, that was a good journey through the code for me :smile:
@mark.needham: I guess the validation should throw an error if you define the same value twice. I think the reason it doesn't is that it delegates that to an apache library
@ahmednagwa6: It was my issue but actually the debug journey of you showed me alot of stuff that i googled and google to become better .. Thank so much for the support man !
@sania: @sania has joined the channel
@javier: @javier has joined the channel
@humengyuk18: Hi all, Is there any docs explaining how to update to a new release? When I upgrade to the new version, new instance will fail to start, it complains an instance with the same name exist. Currently, I can delete that node from zookeeper’s LiveInstance section, then I can drop it, and the new version will start normally. Is there a better way doing this?
@ssubrama: Does this help?
@donomar: @donomar has joined the channel
@bagi.priyank: are segments uploaded from all servers to controller? how can i disable it, assuming it is ok to do that?
@ssubrama: I dont seem to get the context of your question. Is your table a realtime table? The way this works is that after every so many rows are consumed, the servers persist the segment . Are you wanting to stop this step? Why? If you explain the problem you are trying to solve, then it can help
@bagi.priyank: hello subbu. you are correct. i have configured the table to be realtime and it is consuming from a kafka topic. right now replication is set to 1 and retention to 12 hours. i notice segments under `/tmp/data/PinotController/<table_name>/` and `/tmp/data/PinotController/Deleted_Segments/` . i am periodically running into no space on controller instances. this is not happening on server instances. so i was wondering if segments were getting copied from server instances to controller instances, and if it is okay to disable that, how do i do it? both server instances and controller instances have same disc size of 1 TB.
@bagi.priyank: or do you recommend configuring a segment store using s3 to avoid this issue?
@mayanks: Controller data dir (typically a deep-store) is the durable storage needed by Pinot. Think of new servers being added, or existing servers losing data. They will download it from this persistent store. So you definitely shouldn't disable that (moreover, no way to do so).
@mayanks: The `Deleted_Segments` is something that might be avoidable (it is where the deleted segments are kept for sometime before being completely deleted).
@bagi.priyank: thank you for validating my understanding. just to confirm - once i configure s3 as a segment store / deep-store i should stop running into disc issue? or will the segments still be copied over to controller first, and to be copied to s3?
@bagi.priyank: i haven't looked for it yet - but is there a retention on segment store via pinot? or is it expected that user should take care of it themselves?
@mayanks: Once you have deepstore, you won't have this issue.
@mayanks: Retention specified on table config applies to segment store as well
@bagi.priyank: perfect, thank you!
@ssubrama: All deleted segments are moved to the `Deleted_Segments` folder. The retention on that folder is configurable.
@ssubrama: On another note, if you have a retention of 12 hrs, that does not mean that segments are deleted exactly within 12 hours of creation. Two points to note here: 1. Time check is done with the value of the time column you have configured. So, if you consume a segment with time value in future (e.g. `now + 1 day` , for whatever reason) and the retention is conifigured to be 12 hrs, it will not be reomoved soon. 2. The retention manager works periodically (you can adjust the period). Each time it runs, it will delete the segments that are old _at the time of run_. Nothing else will be deleted before the next run.
@bagi.priyank: ah, thank you for sharing this subbu. i am using event timestamp and should probably switch to server timestamp
@acommike: @acommike has joined the channel

#pinot-s3

@ahmednagwa6: @ahmednagwa6 has joined the channel

#getting-started

@vibhor.jain: Hi Team, Is there any support for UNION/UNION ALL type of queries in Pinot? Tried few but no luck. Sample query: _select 'Poor' as grade union all select 'Good' as grade_
@mark.needham: No I don't think union queries are supported. Not sure if it's something that's been asked for before, @mayanks might know
@g.kishore: We did think about it but in a different context - ingest from multiple Kafka topics in two Pinot tables but provide a logical abstraction on top of them
@g.kishore: We did not pursue
@g.kishore: What is your usecase
@zeke.dean: @zeke.dean has joined the channel

#pinot-trino

@hashhar: @hashhar has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]