Apache Pinot Daily Email Digest (2021-08-10)

Pinot Slack Email Digest Tue, 10 Aug 2021 19:00:31 -0700

#general

@mrpringle: is there a way to control skipUpsert through the connection string. trying to use superset to get history but it doesn't allow us to edit generated pinot queries to add the option.
@g.kishore: Don’t think so..
@roberto: @roberto has joined the channel
@anshuman.bhowmik: Hi Is there any anomaly detection algorithms that can be used in pinot?
@mayanks: There's Thirdeye you can look at:
@sachin: @sachin has joined the channel
@jai.patel856: I had a question about enabling Upsert on a hybrid table. We currently have separate REALTIME-only and OFFLINE-only tables that we use to maintain two dashboards with different refresh rates. But we oftentimes find them diverging. To address this I’d like to set these up as a single hybrid table (with upsert on the REALTIME subtable). is that possible? I identify the following considerations: 1. We will call the tables with _OFFLINE and _REALTIME to avoid the broker providing duplicate results. 2. Will the offline table creation ignore the primaryKeyColumns in the schema? 3. Is there a mechanism to set common parameters in the tablespec, like indexes? Otherwise it looks like the star indexes, streamsconfig, routing and upsertConfig rules could easily be set only on the REALTIME segments. Will this work?
@jackie.jxt: I'm a little bit confused here. If you are going to call the OFFLINE and REALTIME separately, what benefit do you get from modeling them as one hybrid table?
@jackie.jxt: Upsert doesn't work on hybrid table as of now, so only the records from the REALTIME table can be invalidated
@jai.patel856: For #2 we benefit from being able to model it with a single schema definition that we can share.
@jai.patel856: And for #1 and #3 it’s more because have automation that encapsulates the creation of the schema and the tablespec. Was wondering if creating a hybrid table would get the desired results before going and changing the tooling too much.
@jai.patel856: So are the upsert specific settings ignored by offline tables altogether?
@jackie.jxt: Upsert cannot be enabled for OFFLINE table, or the table validation will throw exception
@jackie.jxt: I wouldn't suggest using hybrid table if the only benefit is to share a single schema

#random

@roberto: @roberto has joined the channel
@sachin: @sachin has joined the channel

#troubleshooting

@roberto: @roberto has joined the channel
@roberto: Hi!! I’m experiencing some issues with Schema evolution. I have a REALTIME table consuming from a kafka topic with a defined schema. if I add a new column to the table initially the column doesn’t appear but after reloading the segments the table appears in queries but with `null` value. The only way I have achieved to see data on the new column is restarting the server instances. Any idea about I have to do to get the data n the table after update the schema? Note: following the documentation I added the property: `pinot.server.instance.reload.consumingSegment=true` to my `pinot-server-config` but the result is the same.
@mayanks: What version are you using? cc @jackie.jxt
@roberto: I’m using the version `chart=pinot-0.2.4`
@roberto: I have installed it using the helm chart
@roberto: To be honest I not sure if this chart version is the same version of the pinot instance. I have tried to find the pinot version but I can’t find it in the controller.
@jackie.jxt: This is expected. The next consuming segment will start consuming the real values. See for details
@roberto: Yes!! I followed that guide, in fact here it is my current configuration:
@roberto: After update the schema I have ran a `/segments/{tableName}/reload`
@roberto: but after add even if I send new events I’m always getting null on the new columns when query the table.
@roberto: my doubt now is the following: this `chart=pinot-0.2.4` is my pinot version? if it is I’m running a very old version….
@ssubrama: Values from kafka will be populated when the next consuming segment starts. Until then, it will be null values for all segments including the current consuming segment.
@roberto: I’m able to kill and starts segements using the API?
@roberto: because if I have understood correctly if I simply kill the consuming segment and start a new one it should be solved
@ssubrama: That part went through some changes recently. @npawar can you help to see if this works?
@npawar: You could restart the servers, that will make the CONSUMING segments restart, and pick new values. But dont kill the consuming segments (i.e. dont DELETE those segments), that is hard to recover from
@roberto: @npawar as I understand there is no way from the API to do it right?
@roberto: For our use case we need to perform the schema update automatically and we need to be ready to ingest data after the schema update
@roberto: Restart the server is not an easy option for us.
@npawar: you can try reset API. let me confirm if this will do what you want to achieve. if you have a test environment, coould you try it out?
@npawar: this is the API, you can find it under Segment section in swagger ```@POST @Path("segments/{tableNameWithType}/{segmentName}/reset")```
@roberto: @npawar that approach works, after calling the reset we get the column filled but we are not able to query the table during the reset which is not good. We are using a table with one segment, if we modified that it could help us?
@npawar: you mean you will increase the replication? yes, that is a reasonable approach
@roberto: @npawar I have tried to increase the replication but the setting seems to be ignored: ```"segmentsConfig": { "schemaName": "context", "timeColumnName": "timestamp", "replication": "2", "replicasPerPartition": "2" },```
@mayanks: Do you have enough servers with the required tenant tag available that can be used to increase replication?
@npawar: here’s a guide for increasing capacity:
@roberto: I’m running two servers and I have 2 tables, one with 4 segments and the other one with 1 segment and all of them are running on the same server
@npawar: okay. Follow that guide to make sure you 1) tag the new server using same tenant as previous server 2) run a rebalance
@roberto: I have created a new server and tried to rebalance getting this: ```{ "status": "NO_OP", "description": "Table is already balanced", "instanceAssignment": { "CONSUMING": { "instancePartitionsName": "context_CONSUMING", "partitionToInstancesMap": { "0_0": [ "Server_pinot-server-1.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098", "Server_pinot-server-2.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098", "Server_pinot-server-0.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098" ] } } }, "segmentAssignment": { "context__0__0__20210810T2104Z": { "Server_pinot-server-1.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING" } } }```
@roberto: but no luck getting more segments
@npawar: did you set the includeConsuming flag to true?
@roberto: it seems to work but I haven’t seen any change on the number of segments on the UI
@roberto: ```{ "status": "IN_PROGRESS", "description": "In progress, check controller logs for updates", "instanceAssignment": { "CONSUMING": { "instancePartitionsName": "context_CONSUMING", "partitionToInstancesMap": { "0_0": [ "Server_pinot-server-1.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098", "Server_pinot-server-2.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098", "Server_pinot-server-0.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098" ] } } }, "segmentAssignment": { "context__0__0__20210810T2104Z": { "Server_pinot-server-1.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING", "Server_pinot-server-2.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING" } } }```
@npawar: even after a couple of minutes?
@npawar: can you paster the idealState and externalView?
@roberto: externalView: ```{ "OFFLINE": null, "REALTIME": { "context__0__0__20210810T2236Z": { "Server_pinot-server-2.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING" } } }```
@roberto: IdealState ```{ "OFFLINE": null, "REALTIME": { "context__0__0__20210810T2236Z": { "Server_pinot-server-1.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING", "Server_pinot-server-2.pinot-server-headless.pinot-quickstart.svc.cluster.local_8098": "CONSUMING" } } }```
@roberto: I killed server pods and now I can see two instances for the segment
@npawar: hmm, that should’ve been sutomatic. Any chance you saved pinot-server-1's logs before you killed it?
@roberto: ohh I didn’t save it sorry!
@roberto: it seems that even with the replication in segment, when I call reset
@roberto: the queries are not available during the process
@npawar: hmm, my bad. I shouldve realized that the resst call will disable both replicas at once. There is no one-by-one mode there. Perhaps we need one.
@npawar: As for your issue, let me think if there’s another way to prevent downtime
@npawar: unfortunately, right now there’s no way without that downtime during reset. Created an issue for this: . Discussions/contributions welcome!
@roberto: Thank you very much for your help on this @npawar
@sachin: @sachin has joined the channel
@mapshen: Hi we are querying Pinot via Trino with dynamic tables and have recently encountered this error: ```SELECT * FROM pinot.default."SELECT max(COL1) from table0"; Query 20210810_205808_00168_gj79s failed: class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')``` `COL1` can be type of INT, LONG, FLOAT, or DOUBLE in here...not sure if someone has seen this before?
@jackie.jxt: @elon.azoulay Can you please help take a look?
@elon.azoulay: Yep, this is a known issue and is resolved with
@mapshen: Oh thanks Elon! Good to know the aggregation pushdown is going to resolve this. Hope it will get merged soon!
@elon.azoulay: Seems like it's very close!

#pinot-dev

@tingchen: we encountered a server start issue in Pinot 0.7. We proposed to rollback the Reflection lib to 0.9.9 instead of 0.9.11. @kharekartik @xiangfu0 pls let us know what you think.
@tingchen: @jackie.jxt I saw you made an earlier PR to walk around this problem and mentioned that there is dependencies on 0.9.11 reflection lib. Can you elaborate on it? What are the dependencies on that version? Xiang is testing on reverting to 0.9.9 because there are other places in Pinot today using the Reflection lib.
@jackie.jxt: @tingchen Can you please send me the PR link?
@tingchen:
@xiangfu0:
@jackie.jxt: You may generate a dependency tree and see which library depends on the reflections. I didn't downgrade it because some other dependencies may require higher version of reflections (not pinot code)
@tingchen: Xiang is testing Reflection 0.9.9 version, we can check if that build has any issue during tests.
@xiangfu0: seems all ci passed
@g.kishore: where do we use this?
@g.kishore: to find annotations?
@tingchen: Yeah... from the stack trace, this happens when the server tries to set up the swagger UI. It is fetching the config.
@grace.walkuski: Hello! I’m using the pinot client and am getting this exception ```Caused by: java.lang.NumberFormatException: For input string: "20619889534" at java.lang.NumberFormatException.forInputString(Unknown Source) ~[?:?] at java.lang.Integer.parseInt(Unknown Source) ~[?:?] at java.lang.Integer.parseInt(Unknown Source) ~[?:?] at org.apache.pinot.client.AbstractResultSet.getInt(AbstractResultSet.java:58) ~[pinot-java-client-0.7.1.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.client.PinotResultSet.getInt(PinotResultSet.java:263) ~[pinot-jdbc-client-0.7.1.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]``` and I believe it it because 20,619,889,534 is larger than the maximum integer: 2,147,483,647 Is there a solution on my side so I can get around this?
@mayanks: Could you expand the stack trace a bit? I am trying to see who is calling the `getInt`, and can that be replaced by `getLong`
@grace.walkuski: Yea! Here’s some more of the stackTrace. ```java.lang.NumberFormatException: For input string: "20619889534" at java.lang.NumberFormatException.forInputString(Unknown Source) ~[?:?] at java.lang.Integer.parseInt(Unknown Source) ~[?:?] at java.lang.Integer.parseInt(Unknown Source) ~[?:?] at org.apache.pinot.client.AbstractResultSet.getInt(AbstractResultSet.java:58) ~[pinot-java-client-0.7.1.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.client.PinotResultSet.getInt(PinotResultSet.java:263) ~[pinot-jdbc-client-0.7.1.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.jooq.tools.jdbc.DefaultResultSet.getInt(DefaultResultSet.java:134) ~[jooq-3.12.1.jar:?] at org.jooq.impl.CursorImpl$CursorResultSet.getInt(CursorImpl.java:773) ~[jooq-3.12.1.jar:?] at org.jooq.impl.DefaultBinding$DefaultIntegerBinding.get0(DefaultBinding.java:2440) ~[jooq-3.12.1.jar:?] at org.jooq.impl.DefaultBinding$DefaultIntegerBinding.get0(DefaultBinding.java:2412) ~[jooq-3.12.1.jar:?] at org.jooq.impl.DefaultBinding$AbstractBinding.get(DefaultBinding.java:824) ~[jooq-3.12.1.jar:?] at org.jooq.impl.CursorImpl$CursorIterator$CursorRecordInitialiser.setValue(CursorImpl.java:1725) ~[jooq-3.12.1.jar:?] at org.jooq.impl.CursorImpl$CursorIterator$CursorRecordInitialiser.operate(CursorImpl.java:1685) ~[jooq-3.12.1.jar:?] at org.jooq.impl.CursorImpl$CursorIterator$CursorRecordInitialiser.operate(CursorImpl.java:1650) ~[jooq-3.12.1.jar:?] at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:130) ~[jooq-3.12.1.jar:?] at org.jooq.impl.CursorImpl$CursorIterator.fetchNext(CursorImpl.java:1614) ~[jooq-3.12.1.jar:?]``` we are using Jooq to build the queries which is what is calling getInt. I will try and see if I can get Jooq to call getLong instead…
@mayanks: I see. Pinot response returns the metadata, which has datatype in it. I am unsure how you are using jooq, but ideally we need to look at the data type in the result metadata and use the appropriate api. Alternatively, folks simply just use String/Long/Double to avoid data type checks.
@grace.walkuski: Ok, this case is for a `count(field_name)` which i feel like would always be an int. How can I make it be a Long?
@mayanks: Pinot always returns `long` for `count` queries.
@grace.walkuski: Oh! Ok great

#getting-started

@kangren.chia: hello, im just curious how queries run faster when you call the exact same query the second time round. how does caching work in pinot? if it matters, i’m using the inverted index, with no partition pruning (only 1 broker running). broker looks something like this: ```Processed requestId=34,table=sorted_events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/198/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=426,resSerMs=0,totalTimeMs=426,minConsumingFreshnessMs=-1,broker=Broker_172.26.0.4_8099,numDocsScanned=259467,scanInFilter=619119085,scanPostFilter=259467,sched=fcfs Slow query: request handler processing time: 427, send response latency: 3, total time to handle request: 430 Processed requestId=35,table=events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/118/-1,schedulerWaitMs=0,reqDeserMs=5,totalExecMs=221,resSerMs=0,totalTimeMs=226,minConsumingFreshnessMs=-1,broker=Broker_172.26.0.4_8099,numDocsScanned=657,scanInFilter=346815,scanPostFilter=657,sched=fcfs``` also, i’m wondering what is considered prompts `"Slow query: …"` to show up in logs? does this mean pinot is suggesting that some optimization is possible to speed up my queries?
@mayanks: It is not caching, but cold start. Pinot mmaps index files, the OS needs to pull in pages the very first time.
@kangren.chia: understood. thanks!
@mayanks: Right now the default value to print `slow query` is when latency is `> 100ms`
@kangren.chia: so its a hardcoded value. got it! also kind of mindblowing that you guys chose 100ms as the hardcoded value :joy:
@mayanks: yeah, 100ms is `slow` for Pinot :wink:
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]