Apache Pinot Daily Email Digest (2021-03-30)

Pinot Slack Email Digest Tue, 30 Mar 2021 19:00:37 -0700

#general

@kmvb.tau: @kmvb.tau has joined the channel
@joshhighley: ```If there are multiple controllers, Pinot expects that all of them are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS``` I can't find more info about this. Would any mountable NFS work? S3 for example?
@mayanks: Yes. NFS would work. So would any deep store
@mayanks: We have implementations for NFS, ADLS, S3 and GCS
@g.kishore: If you have S3, you don’t need NFS
@mayanks: Correct. S3/ADL/GCS as deep-stores can be shared across controllers without any further need of NFS on top of that.
@fx19880617:

#random

@kmvb.tau: @kmvb.tau has joined the channel

#troubleshooting

@kmvb.tau: @kmvb.tau has joined the channel
@chxing:
@chxing:
@mayanks: What's the data type for `webConferenceId`?
@chxing: LONG
@mayanks: Is this a hybrid table?
@chxing:
@chxing: It should be a realtime table
@mayanks: You have 90 days retention, so my guess is there's an offline component. But that is ok
@mayanks: At the face of it, this seems like a bug
@mayanks: Trying to understand what might be causing it
@chxing: So it should be a bug?
@mayanks: Can you try with `webSiteId`?
@chxing: String Type is ok, let me try again
@jackie.jxt: Can you please paste the entire table config?
@chxing: schema< `{` “schemaName”:“realtime_sjc_wmequality_report”, “dimensionFieldSpecs”:[ { “name”:“webexSiteName”, “dataType”:“STRING” }, { “name”:“webexConferenceId”, “dataType”:“LONG” }, { “name”:“webexSiteId”, “dataType”:“LONG” }, { “name”:“correlationId”, “dataType”:“STRING” }, { “name”:“metadataOsType”, “dataType”:“STRING” }, { “name”:“metadataOsVersion”, “dataType”:“STRING” }, { “name”:“metadataBrowserType”, “dataType”:“STRING” }, { “name”:“metadataClientType”, “dataType”:“STRING” }, { “name”:“metadataClientVersion”, “dataType”:“STRING” }, { “name”:“metadataHardwareType”, “dataType”:“STRING” }, { “name”:“metadataNetworkType”, “dataType”:“STRING” }, { “name”:“audioMainReportTransportType”, “dataType”:“STRING” }, { “name”:“videoMainReportTransportType”, “dataType”:“STRING” }, { “name”:“day”, “dataType”:“STRING” } ], “metricFieldSpecs”:[ { “name”:“systemAverageCPU”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“processAverageCPU”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“osBitWidth”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“cpuBitWidth”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“audioMainReportRxE2eLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“audioMainReportRxE2eJitter”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“audioMainReportTxHbhLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“audioMainReportTxHbhJitter”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“audioMainReportRxHbhLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“audioMainReportRoundTripTime”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“videoMainReportRxE2eLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“videoMainReportRxE2eJitter”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“videoMainReportTxHbhLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“videoMainReportTxHbhJitter”, “dataType”:“LONG”, “defaultNullValue”:0 }, { “name”:“videoMainReportRxHbhLostPercent”, “dataType”:“FLOAT”, “defaultNullValue”:0 }, { “name”:“videoMainReportRoundTripTime”, “dataType”:“LONG”, “defaultNullValue”:0 } ], “dateTimeFieldSpecs”:[ { “name”:“timestamp”, “dataType”:“STRING”, “format”:“1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd’T’HH:mm:ss.SSS’Z’“, “granularity”:“1:MILLISECONDS” } ] }
@chxing: table: `{` “tableName”:“realtime_sjc_wmequality_report”, “tableType”:“REALTIME”, “segmentsConfig”:{ “timeColumnName”:“timestamp”, “timeType”:“DAYS”, “retentionTimeUnit”:“DAYS”, “retentionTimeValue”:“90”, “segmentPushType”:“APPEND”, “segmentAssignmentStrategy”:“BalanceNumSegmentAssignmentStrategy”, “schemaName”:“realtime_sjc_wmequality_report”, “replication”:“2”, “replicasPerPartition”:“2” }, “tenants”:{}, “tableIndexConfig”:{ “loadMode”:“MMAP”, “streamConfigs”:{ “streamType”:“kafka”, “stream.kafka.consumer.type”:“LowLevel”, “stream.kafka.topic.name”:“sj1_mqa_telemetry_wmequality_report”, “stream.kafka.decoder.class.name”:“org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder”, “stream.kafka.consumer.factory.class.name”:“org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory”, “stream.kafka.broker.list”:“10.241.89.130:9092", “realtime.segment.flush.threshold.time”: “24h”, “realtime.segment.flush.threshold.size”: “300M”, “stream.kafka.consumer.prop.auto.offset.reset”: “largest” }, “invertedIndexColumns”:[ “webexSiteName”, “webexConferenceId”, “webexSiteId”, “correlationId”, “metadataOsType”, “metadataBrowserType”, “metadataClientType”, “metadataHardwareType”, “metadataNetworkType”,“audioMainReportTransportType”,“videoMainReportTransportType”,“day” ], “sortedColumn”:[“audioMainReportRxE2eLostPercent”,“audioMainReportRxE2eJitter”] }, “metadata”:{“customConfigs”:{}} }
@jackie.jxt: Do you have time for a quick zoom? Need to try more queries to identify the issue
@chxing: webSiteId is ok, also LONG type
@jackie.jxt: How about `select * from table where websiteId = 8049967 limit 1000`? Want to see if the missing conferenceId is returned here
@chxing: ok
@mayanks: Yeah, that is what I also meant earlier
@chxing:
@chxing: No response?
@chxing:
@chxing: The status of segments, seems normal
@jackie.jxt: Sorry, should be `webexSiteId = 8049967`
@chxing: ok
@chxing:
@chxing: has response
@jackie.jxt: Let's try `select * from realtime_sjc_wmequality_report where webexConferenceId = '189852985506937900' limit 1000` first to rule out the possibility of compilation problem
@chxing:
@chxing: No response
@jackie.jxt: @chxing Can you join this zoom? We can try some queries together to track down the problem
@chxing: Wait a minute, I need to ask my manager
@jackie.jxt: Sure
@chxing: ok
@chxing: Joined
@chxing: Hi All, I got one issue, I select * from db to make sure we db have this item in db
@chxing: But I can’t use this “select * from realtime_sjc_wmequality_report where webexConferenceId=189852985506937900 limit 1000” to get it
@fx19880617: What's the error?
@fx19880617: Use single quote for big int?
@chxing: webexConferenceId is LONG type ,I just use where webexConferenceId=189852985506937900
@chxing: but no response

#pinot-dev

@khushbu.agarwal: Hi When a server is in dead state(update of deployment in kubernetes) pinot doesn't rebalance the segments among existing/new severs. Even manual rebalance is not helping(result: already balanced). Tried deleting the instance. It fails with error :"server is in ideal state of xyz table" . How do I resolve this?
@oren: @oren has joined the channel
@npawar: You have to untag that server then rebalance:
@khushbu.agarwal: Thanks @npawar this fixed the issue. Although wondering why it didn't resolve automatically?
@npawar: Rebalance is not designed to adjust automatically
@khushbu.agarwal: On deployment update?
@khushbu.agarwal: What about when a server is in dead state for a long period?
@npawar: If the server is still in zk, and tagged with the same tag used by the table, then it will continue to be used. @fx19880617 is there a way to achieve automatic removal of the server or rebalance when using k8 deployment? I would guess not?
@fx19880617: no, pinot cannot figure out if this server is dead for a long time or should be recycled, it has to be human intervention, however, user can build some script or tooling to do periodically check and perform the action
@fx19880617: for k8s, usually the server will be restarted and back to normal
@khushbu.agarwal: This is the case where after deployment the pod ip changes
@fx19880617: but service name doesn’t change right?
@fx19880617: k8s should handle the dns
@npawar: @khushbu.agarwal ^^
@g.kishore: Let’s repost this in troubleshooting..
@amrish.k.lal: Question about JSON functions described in . Are the following functions supported in SQL or does the documentation need to be modified? • TOJSONMAPSTR • JSONFORMAT • JSONPATHLONG • JSONPATHDOUBLE • JSONPATHSTRING • JSONPATHARRAY They seem to take Java Objects as inputs?
@jackie.jxt: Seems it only works for data ingestion but not in SQL. Can you please submit an issue about this?
@ken: @amrish.k.lal thanks for reporting this! Also note there’s a <#C01822DR7UP|pinot-docs> Slack channel that’s good for doc-related issues like this.

#pinot-rack-awareness

@jaydesai.jd: Hey @g.kishore @rkanumul Thanks for reviewing the doc. Can we sign off on it today ?
@dlavoie: Design is LGTM! Do we plan on providing an out of the box pinot property provider in addition to the azure specific provider?
@rkanumul: Not part of the plan atm.. But I thought we might need a Noop impl probably… With your new suggestion, the out of the box property based option will just work.. so leaning towards it
@g.kishore: we need a better name for plugin folder
@g.kishore: everything else looks good to me
@g.kishore: what is a good term to describe where Pinot is deployed
@g.kishore: on prem, k8s, gcp, azure, aws
@dlavoie: I like the zone awareness term
@g.kishore: i am looking for a more generic term
@g.kishore: zoneawareness is a subset of it
@dlavoie: Zone feels generic to me. Can be a rack, a datacenter room, a region, a cloud provider or a continent
@g.kishore: future proof it a bit
@g.kishore: this will be a pinot-plugin
@g.kishore: pinot-plugins/pinot-zone/pinot-azure ?
@g.kishore: i dont think that makes sense
@dlavoie: For the plugin name, I agree
@dlavoie: zone-discovery-provider ?
@dlavoie: Yeah, took a step back and actually I have design comment, I’ll share them in the doc.
@jaydesai.jd: @dlavoie Updated the document with your suggestion. Can u review it again. Thanks :slightly_smiling_face: cc @g.kishore
@dlavoie: Looks good :slightly_smiling_face:
@jaydesai.jd: Can u sign off at the bottom of the Document. I have added your name to the reviewers list. Thanks :slightly_smiling_face:

#minion-improvements

@laxman: @laxman has joined the channel
@laxman: @laxman set the channel description: Minion improvements
@g.kishore: @g.kishore has joined the channel
@npawar: @npawar has joined the channel
@jackie.jxt: @jackie.jxt has joined the channel
@fx19880617: @fx19880617 has joined the channel
@laxman: Hi Team, I’m Laxman from Traceable. We use Pinot in our system. We want to collaborate and contribute to Pinot Minion project. Our major product requirement is “Data deletion for a specific filter criteria”
@laxman: Created this channel and added you people as suggested by Kishore.
@jackie.jxt: @fx19880617 How's the progress on the minion pluggable tasks? This can be modeled as a purge task
@fx19880617: it’s there
@fx19880617: you and add your own minion tasks in parallel to this minion-builtin-tasks:
@fx19880617: just follow examples of existing built-in tasks and create your shaded jars
@jackie.jxt: In order to use the existing `PurgeTaskExecutor`, the `RecordPurgerFactory` and `RecordModifierFactory` need to be registered into the `MinionContext`, which cannot be done via config yet
@fx19880617: you can also follow this PR to see what I touched for pinot-distribution/assemble.xml file:
@laxman: I see lot of work done in release. Am trying to catchup going through release notes.
@laxman: Do we have any epic/parent jira where all this Minion work is tracked?
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org