Apache Pinot Daily Email Digest (2021-12-10)

Pinot Slack Email Digest Fri, 10 Dec 2021 18:00:33 -0800

#general

@saulo.sobreiro: @saulo.sobreiro has joined the channel
@serhiish: @serhiish has joined the channel
@utkarsh.saxena: @utkarsh.saxena has joined the channel
@mapshen: Is there a way or an API to get the latest offset consumed for a real-time table/segment?
@npawar: Under swagger Apis look for consumingSegmentsInfo API
@mapshen: Ah sweet. The intention is to monitor if the consuming segment offset is in sync with the kafka partition offset. Does Pinot expose such a metric already?
@mapshen: @npawar alternatively, what would be your suggestion on monitoring this?
@npawar: this API is the only way to monitor exact offset consumed from Pinot side. Oe metric that is useful is LLC_PARTITION_CONSUMING. This is a gauge which will be 0 if the partition is not consuming for any reason. Monitoring this in a way such as “if 0 for more than 10 minutes, alert” would be good
@npawar: @mayanks just checking, we dont have any other ways to monitor lag betwee kafka latest offset ad consumer offset right?
@mayanks: @npawar @mapshen yes, I am not aware of any other existing ways to monitor.
@priyenpatel2014: @priyenpatel2014 has joined the channel
@lars-kristian_svenoy: Hey guys. Regarding (The Log4j vulnerability) when can we expect a release of Pinot to mitigate that? I see you just recently merged a PR to deal with it:
@mayanks: Hey @lars-kristian_svenoy what version of jvm are you using and is your Pinot accessible from internet?
@mayanks: ```JDK versions greater than 6u211, 7u201, 8u191, and 11.0.1 are not affected by the LDAP attack vector. In these versions com.sun.jndi.ldap.object.trustURLCodebase is set to false meaning JNDI cannot load a remote codebase using LDAP.```
@mayanks: In the interim, you can `formatMsgNoLookups=true` as a w/a.
@lars-kristian_svenoy: I am using the jdk11 image of pinot, is it built with > 11.0.1?
@j.vinodpatel: @j.vinodpatel has joined the channel

#random

#troubleshooting

@saulo.sobreiro: @saulo.sobreiro has joined the channel
@tanmay.movva: Hello. I am trying out the Pinot Connector in Trino and I am facing the an error on a simple select query like ```select * from pinot.default.table limit 10``` This is the stacktrace of the error. Can anyone please help? Did anyone face a similar issue before? ```java.lang.NullPointerException: null value in entry: Server_server-2.server-headless.pinot.svc.cluster.local_8098=null at com.google.common.collect.CollectPreconditions.checkEntryNotNull(CollectPreconditions.java:32) at com.google.common.collect.SingletonImmutableBiMap.<init>(SingletonImmutableBiMap.java:42) at com.google.common.collect.ImmutableBiMap.of(ImmutableBiMap.java:72) at com.google.common.collect.ImmutableMap.of(ImmutableMap.java:119) at com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:454) at com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:433) at io.trino.plugin.pinot.PinotSegmentPageSource.queryPinot(PinotSegmentPageSource.java:221) at io.trino.plugin.pinot.PinotSegmentPageSource.fetchPinotData(PinotSegmentPageSource.java:182) at io.trino.plugin.pinot.PinotSegmentPageSource.getNextPage(PinotSegmentPageSource.java:150) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:311) at io.trino.operator.Driver.processInternal(Driver.java:387) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291) at io.trino.operator.Driver.tryWithLock(Driver.java:683) at io.trino.operator.Driver.processFor(Driver.java:284) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484) at io.trino.$gen.Trino_362____20211126_004329_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)```
@tanmay.movva: Our pinot is deployed on Kubernetes and every component has a headless service and a `ClusterIP` service. Also I am able to query that table directly from Pinot UI. But the query doesn’t work on trino.
@mayanks: What version of Pinot and Trino?
@mayanks: Some possibilities - a) Setup issue b) Connectivity issue c) Version mismatch
@tanmay.movva: trino - 362. Pinot - 0.9.0 Connectivity is there between two services. I am able to run metadata queries such as ```show table from pinot.default```
@mayanks: I found similar thing discussed a while back:
@mayanks: Search for `PinotSegmentPageSource` in that link
@mayanks: cc @elon.azoulay
@elon.azoulay: Can you try trino 365? It is compatible with pinot 0.8.0
@tanmay.movva: Sure. Will upgrade and let you know.
@elon.azoulay: We are working on updating to be compatible with pinot 0.9.0
@tanmay.movva: So we are on 0.9.0 for pInot. Will it work with Trino 365 now? Or would I have to downgrade to pinot 0.8.0?
@elon.azoulay: It should - it looks like the api's from 0.8.0 to 0.9.0 that trino connector uses are similar
@elon.azoulay: I think pinot 0.9.0 has some really great features, no need to downgrade.
@tanmay.movva: Thanks this is working.
@kangren.chia: hi just checking again, does anybody know how i can bypass the 1 million limit on rows returned by the broker?
@serhiish: @serhiish has joined the channel
@alihaydar.atil: Hello everyone, i wonder that would setting maxLength property of STRING data types in schema to high values cause extra memory allocation or performance degradation?
@utkarsh.saxena: @utkarsh.saxena has joined the channel
@falexvr: good morning guys. We setup a pinot cluster a while ago, while it was 0.6.0 the latest version and recently spawned a new cluster with version 0.8.0 to test it before using it in production. Before we start streaming data into this new cluster I’d like to know first if having two clusters with low level kafka consumers streaming data from the same topic would represent an issue? I ask this because I see the current cluster doesn’t rely on kafka consumer groups to keep track of the offsets, on the other hand in our kafka provider I see there is an empty named consumer group consuming data from the topics and it seems that one belongs to pinot
@xiangfu0: I assume these two clusters are separated(on a new zk cluster or same zk but different helix cluster name). Then you are fine, you can create multiple tables consuming same kafka topic as well. Pinot internally track zk offsets per table basis.
@falexvr: Yep, different zk clusters as well
@falexvr: Great! Thanks
@priyenpatel2014: @priyenpatel2014 has joined the channel
@jeff.moszuti: Hello, I am kicking the tyres of Pinot (v 0.9.0) by doing the following tutorial . I load 4 records from a CSV file into an offline table named transcript and I get 4 rows returned when executing the following statement `select * from transcript limit 10` . As soon as I upload a realtime table config and schema () only 3 rows are returned when running the same SQL statement. I do however see 4 rows if I query the offline table e.g. `select * from transcript_OFFLINE limit 10` . What could be the reason?
@g.kishore: Time boundary ..
@g.kishore: Latest days data will be pulled from real-time and not offline
@g.kishore: Idea is give enough time for the batch jobs to push to offline tables and avoid inconsistency during push
@jeff.moszuti: At the moment I haven't pushed any real-data yet, I just created the realtime table config and schema. Let's say, no real-time data comes in for some time. At which point will selecting from transcript return back 4 rows? Are there any setting that I can change to get a better understanding on the time boundary works?
@mayanks: The expectation from a hybrid table is that data is flowing in both, and that there’s data overlap.
@mayanks: If you are only interested in offline component, you can query offline table explicitly by appending suffix _OFFLINE to the table name in the query
@jeff.moszuti: Thanks for the replies Kishore and Mayank. I've read the documentation on time boundaries but still a bit confused on how a hybrid table is supposed to be used. In the test data of the tutorial, the records for offline and real time are unique but there a few records which exist both on the Oct 24 and Oct 28. The hybrid table shows the records as in the diagram below - records from the real-time table on the 24th are not visible and the records from offline table on the 30th are also not visible. I understand why this has been done. Given that the records are unique, will the hybrid table show all records as some point later in time (reconciled?). I'll like to be able to count the number of student transcripts to get an accurate total.
@j.vinodpatel: @j.vinodpatel has joined the channel
@weixiang.sun: Currently hybrid table is between offline table and realtime table. Is it possible to hybrid offline table and upsert table?
@mayanks: Upsert is limited to realtime only. You can have a hybrid table with upsert enabled realtime. However, upserts will not apply to offline
@weixiang.sun: Thanks @mayanks! When I created the hybrid table out of offline table and upsert table, should I just follow the same process?
@mayanks: Yes
@mayanks: The upsert table is nothing but a real-time table with upsert enabled
@mayanks: Also you want to ensure it your app is functionally ok with the fact that there won’t be any upsert coming to time that is in offline
@mayanks: Otherwise you will have incorrect results
@weixiang.sun: @mayanks Thanks!

#pinot-dev

@serhiish: @serhiish has joined the channel
@richard892: @kharekartik can you merge this please?
@navina: @navina has joined the channel

#getting-started

@navina: @navina has joined the channel

#pinot-docsrus

@jeff.moszuti: @jeff.moszuti has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]