Apache Pinot Daily Email Digest (2021-09-21)

Pinot Slack Email Digest Tue, 21 Sep 2021 19:00:38 -0700

#general

@preetam: @preetam has joined the channel
@amol: @amol has joined the channel
@kruti.chauhan.93: @kruti.chauhan.93 has joined the channel
@karinwolok1: Open CFP if anyone is interested in submitting a talk about the cool stuff you are doing with Apache Pinot! :slightly_smiling_face: CFP closes this Friday!
@pradeepks2003: @pradeepks2003 has joined the channel
@allison: Hi all - We just sent out the first Apache Pinot Newsletter via email to all who subscribed! Here's a public link to view it in case you didn't sign up when you joined Slack: We'll share this monthly with upcoming meetups/events, Pinot releases, community blogs, and more! If you'd like to get it via email, click the link at the top of the newsletter.
@yifanzhao0210: @yifanzhao0210 has joined the channel

#random

@preetam: @preetam has joined the channel
@amol: @amol has joined the channel
@kruti.chauhan.93: @kruti.chauhan.93 has joined the channel
@pradeepks2003: @pradeepks2003 has joined the channel
@yifanzhao0210: @yifanzhao0210 has joined the channel

#troubleshooting

@preetam: @preetam has joined the channel
@amol: @amol has joined the channel
@amol: Hello team !
@amol: Actually when i am doing setup of kafka and pinot on AWS facing some issues. Like i have installed pinot using docker container and in same network i have launched kafka container but kafka container is shutting down automatically i check the logs please have a look into this and help me out. Thanks.
@mags.carlin: Hi team, The query response time is approx 5sec all the time on Pinot. Is there any ideal configs suggested for prod environment?Also, currently we see server is taking up 91% server requests. We are going to double up CPU for Pinot server.But just wanted to check, what is the ideal configs for broker,controller and zk/server
@mags.carlin: Cc @mohamed.sultan @mohamedkashifuddin @nadeemsadim
@mags.carlin: @shaileshjha061
@mags.carlin: This is of little high priority, if you can be of assistance ,it would be great!
@gonzalo: Hello team, I’m trying to query JSON data. I have a field called `category` with a JsonArray, for example: `["category1", "category2", "category3"]` and I need to get all rows where `category` contains `category2` I was looking at but I don’t see any way for searching inside a JsonArray. Any idea how it could be done?
@g.kishore: you can write the query assuming category is single value
@g.kishore: we do a bunch of things internally to handle arrays
@gonzalo: Do you mean like this? ```select category from table where category = 'category2' ``` I tried but I don’t get any results
@g.kishore: You still need to use JSON Match
@luisfernandez: I added a new table yesterday to my pinot cluster I wonder what could have caused this metrics to go up a lot
@luisfernandez: idk if i made the segment size for this particular table too big, since it’s still consuming from yesterday and it hasn’t created new segments
@mayanks: The off heap usage is from consuming segments
@mayanks: What’s your setting for the segment size thresholds
@luisfernandez: ``` "realtime.segment.flush.threshold.rows": "500000000", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.segment.size": "250M"```
@luisfernandez: may have gone too crazy with that first
@npawar: You should set the rows threshold to 0, only then will the size threshold kick in
@npawar: Other than that the config looks fine
@luisfernandez: :pray: thank you
@luisfernandez: can I do that while my segment is already going? what would happen to it
@npawar: You can, it should apply from the next segment onwards. But if this consuming segment is looking like it might create problems, it might even be a good idea to restart the servers so that this segment also picks up the config
@kruti.chauhan.93: @kruti.chauhan.93 has joined the channel
@mosiac: Hello, I'm thinking of a way of deploying pinot on k8s suitable for my internal systems. Is the following an option: deploying servers, brokers and controllers without persistent storage and counting on S3 deepstorage for data persistence? PS: I'm only interested in streaming data. >From my understanding this would only really affect the server, since it stores completed segment both local disk and S3 (and probably prefers reading from local rather than downloading from S3). Do controllers and brokers make use of persistent storage in any?
@g.kishore: Brokers don’t need persistent storage..
@g.kishore: Controllers don’t need it if you use URI based segment push
@mosiac: Thanks! Do you think I can get away with using ephemeral local storage for the server? Documentation on how deepstorage is used is sparse. Do servers save backups of segments right after they are created/completed?
@g.kishore: Yes. Servers backup immediately as part of segment commit
@g.kishore: Ephemeral storage is fine as long as you plan for startup of a new server - delay while segments get pulled from deep store
@mayanks: To add, ephemeral storage would need to be large enough to store what you need to store per server.
@mosiac: That a good point, didn't think of that, ty
@pradeepks2003: @pradeepks2003 has joined the channel
@yifanzhao0210: @yifanzhao0210 has joined the channel

#pinot-dev

@ken: We noticed that our automated running of the batch ingestion job (via `bin/pinot-admin.sh`) wasn’t stopping when batch ingest failed. In looking at `PinotAdministrator`, it seems like you have to set the `pinot.admin.system.exit` System property to `true` for this to work. Any reason why `pinot-admin.sh` shouldn’t be setting this to true if `JAVA_OPTS` isn’t specified? E.g. something like ```if [ -z "$JAVA_OPTS" ] ; then ALL_JAVA_OPTS="-Xms4G -Dlog4j2.configurationFile=conf/log4j2.xml -Dpinot.admin.system.exit=true" else ALL_JAVA_OPTS=$JAVA_OPTS fi```
@ken: And for some reason, out of all the sub-commands, only `LaunchDataIngestionJobCommand` has its own `main()` method. Any reason why it needs a main method?
@g.kishore: probably used in some test cases and we dont want them to exit
@g.kishore: thats just my guess
@ken: Yes re not always calling System.exit(), but wondering why the bash script doesn’t set up for it to return a status code that way…since I don’t think the test cases are using `pinot-admin.sh`, right?
@g.kishore: yeah, not sure why.

#getting-started

@kangren.chia: hello, i’ll like to clarify the usage of dimension tables - can i use the columns in `dimTable` but not `factTable` to filter in the WHERE clause? ```Table factTable: string uuid int metric timestamp event_time string status``` ```Table dimTable: string uuid string name string country``` ```SELECT f.uuid, d.name, d.country, abs(sum(m.metric)) as sum_metric FROM factTable f join dimTable d on f.uuid = d.uuid WHERE d.country in ('USA') GROUP BY 1, 2, 3 ORDER BY 2```
@arun11299: @arun11299 has joined the channel
@arun11299: Hello Folks, Can someone point to me a document about how segments are read from both local storage and deep storage ? Can the cluster automatically recover from deep storage when local segment store is cleared ? I want to basically know how the read/write path is in the presence and absence of deep storage.
@mayanks: This video might help:
@mayanks: But I can help any specific questions as well. Yes the cluster can automatically recover from deep storage when local segment store is cleared.
@mayanks: Deep or persistent store attached to Pinot controller is definitely advisable for fault tolerance
@arun11299: Thanks in advance.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org