Apache Pinot Daily Email Digest (2021-04-16)

Pinot Slack Email Digest Fri, 16 Apr 2021 19:00:44 -0700

#general

@hasancan.volaka: @hasancan.volaka has joined the channel
@mohanpandiyan.ms: Hi folks, I am looking into the Pinot JDBC connector. It looks like the access control is ignored by the driver?
@g.kishore: that might need some enhancement to the driver
@mohanpandiyan.ms: thanks for confirmation, let me try to push a PR.
@g.kishore: thanks
@pabraham.usa: Hello, I managed to deploy Pinot to one of our AWS Production environment. it’s been up for few days and all looks well so far. Thanks to everyone for their support from this group. Especially @g.kishore, @fx19880617, @mayanks, @npawar, @ssubrama, @dlavoie, @jackie.jxt and many others who replied to my queries. Also special thanks to @steotia who helped me extensively to get the Text Index up and running which is one of the main feature I am using. There were few issues initially, However, Sidd showed the willingness to jump to zoom and supported me to get all those resolved. Thanks again..!
@mayanks: Awesome :tada:

#random

@hasancan.volaka: @hasancan.volaka has joined the channel

#feat-text-search

@kuantian.zhang01: @kuantian.zhang01 has joined the channel
@selvakumarcts: @selvakumarcts has joined the channel
@btripathy: @btripathy has joined the channel

#feat-presto-connector

@kuantian.zhang01: @kuantian.zhang01 has joined the channel
@selvakumarcts: @selvakumarcts has joined the channel
@btripathy: @btripathy has joined the channel

#troubleshooting

@laxman: Want to try pinot 0.7.1 release. And we are using the artifacts from jitpack. The following shaded artifact is not available for 0.7.1 release. And we are using this artifact. ```wget -qO temp.zip ``` However, I can see the shaded artifact is available under maven central here. Any suggestions on how to proceed here? @g.kishore @fx19880617
@laxman: Info: Shaded artifacts were available for old pinot releases in jitpack
@fx19880617: Maven is used to host all apache libs
@fx19880617: we don’t publish to jitpack. I don’t how the libs there are landed
@laxman: Got the root cause
@laxman:
@laxman: Build failed on jitpack
@fx19880617: got it
@fx19880617: then I need to patch the release
@fx19880617: it’s an issue introduced recently
@fx19880617: by npm
@laxman: But how 0.7.1 build passed and uploaded to maven central
@fx19880617: yes, that was because, the npm update which causes the issue after our apache release :rolling_on_the_floor_laughing:
@laxman: okay. npm version is not locked in out pinot build?
@laxman: How do we fix this?
@fx19880617: I cherry-picked the fix and pushed a new tag: release-0.7.1-ui-fix
@fx19880617: once it got picked
@fx19880617: you should see it from jitpack
@laxman: cool. thanks a lot @fx19880617
@laxman:
@laxman: Jitpack started building this
@fx19880617: ok.
@fx19880617: does jitpack build branch ?
@laxman:
@fx19880617: seems no?
@fx19880617: then I should push a new branch instead of a tag
@laxman: I am also not fully aware how jitpack works. But I see your npm fix is picked up and it started building
@laxman:
@laxman: Build log :point_up_2:
@fx19880617: ok
@fx19880617: then will let it build
@fx19880617: I could delete the tag later on :stuck_out_tongue:
@laxman: Build complete. Now the distros are available in jitpack too
@jmeyer: Hello :slightly_smiling_face: I've evolved a realtime table schema (renaming column) and seeing the following error when querying ```[ { "errorCode": 500, "message": "MergeResponseError:\nData schema mismatch between merged block: [eventTimeString(STRING),communityId(STRING),eventType(STRING),ibcustomer(STRING),newsId(STRING),timeString(STRING),userId(STRING)] and block to merge: [eventTimeString(STRING),communityId(STRING),ibcustomer(STRING),newsId(STRING),timeString(STRING),type(STRING),userId(STRING)], drop block to merge" } ]``` I'm aware of `pinot.server.instance.reload.consumingSegment` setting - can you confirm setting it to `true` will solve this problem ? If so, maybe the error message should contain this bit of information ? or a reference to the doc ?
@fx19880617: I think pinot only support backward compatible schema changes, which means delete a column is not supported.
@fx19880617: you can keep the old column, it will be filled with null value
@jmeyer: Ah that makes sense Maybe we should prevent updating the schema in such a way then ? (fail early)
@fx19880617: agreed
@fx19880617: We are adding this feature to avoid certain update
@jmeyer: Oh great Mind sharing the issue / PR regarding this feature ?
@fx19880617: also discussing the support for schema versioning
@fx19880617:
@jmeyer: Thanks @fx19880617 :slightly_smiling_face:
@hasancan.volaka: @hasancan.volaka has joined the channel
@jmeyer: I've got 2 questions regarding realtime ingestion filtering : • Is it possible / recommended to use a wildcard (ex: keep every `events.something.*`) ? • Is it possible to filter on a column not part of the schema (ex: when filtering on `event.this_event_type.only`) ? Thanks :smile:
@phuchdh: I’m using spark-job to with jobType: SegmentCreationAndUriPush It’s seem the bug on `copy` function ?
@phuchdh: Config.yaml ```# # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # executionFrameworkSpec: Defines ingestion jobs to be running. executionFrameworkSpec: # name: execution framework name name: 'spark' # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface. segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner' # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface. segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner' # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface. segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner' # extraConfigs: extra configs for execution framework. extraConfigs: stagingDir: gs://{bucket_name}/tmp # jobType: Pinot ingestion job type. # Supported job types are: # 'SegmentCreation' # 'SegmentTarPush' # 'SegmentUriPush' # 'SegmentCreationAndTarPush' # 'SegmentCreationAndUriPush' jobType: SegmentCreationAndUriPush # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS. inputDirURI: 'gs://{bucket_name}/rule_logs' # includeFileNamePattern: include file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will include all the avro files under inputDirURI recursively. includeFileNamePattern: 'glob:**/*.avro' # excludeFileNamePattern: exclude file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will exclude all the avro files under inputDirURI recursively. # _excludeFileNamePattern: '' # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS. outputDirURI: 'gs://{bucket_name}/data' # overwriteOutput: Overwrite output segments if existed. overwriteOutput: true pinotFSSpecs: - # scheme: used to identify a PinotFS. # E.g. local, hdfs, dbfs, etc scheme: gs className: org.apache.pinot.plugin.filesystem.GcsPinotFS configs: 'projectId': 'xxxx' 'gcpKey' : 'xxx.json' # recordReaderSpec: defines all record reader recordReaderSpec: # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc. dataFormat: 'avro' # org.apache.pinot.plugin.inputformat.avro.AvroRecordReader # org.apache.pinot.plugin.inputformat.csv.CSVRecordReader # org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader # org.apache.pinot.plugin.inputformat.json.JSONRecordReader # org.apache.pinot.plugin.inputformat.orc.ORCRecordReader # org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader' # tableSpec: defines table name and where to fetch corresponding table config and table schema. tableSpec: # tableName: Table name tableName: 'RuleLogsUAT' # schemaURI: defines where to read the table schema, supports PinotFS or HTTP.a schemaURI: '' # Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper. # The real table config is the object under the field 'OFFLINE'. tableConfigURI: '' # segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator. segmentNameGeneratorSpec: # type: Current supported type is 'simple' and 'normalizedDate'. type: normalizedDate # configs: Configs to init SegmentNameGenerator. configs: segment.name.prefix: 'rule_logs_uat' exclude.sequence.id: true # pinotClusterSpecs: defines the Pinot Cluster Access Point. pinotClusterSpecs: - # controllerURI: used to fetch table/schema information and data push. controllerURI: '' # pushJobSpec: defines segment push job related configuration. pushJobSpec: # pushParallelism: push job parallelism, default is 1. pushParallelism: 2 # pushAttempts: number of attempts for push job, default is 1, which means no retry. pushAttempts: 2 # pushRetryIntervalMillis: retry wait Ms, default to 1 second. pushRetryIntervalMillis: 1000```
@ken: Without seeing the actual `` path, or an obfuscated version of the same, it’s hard to know why the `normalizeToDirectoryUri` method thinks you have a relative path in your absolute URI.
@fx19880617: my feeling is that the bucket name got dropped from the processing?
@ken: Hmm, maybe it’s a Spark job issue that’s similar to the issue I fixed for Hadoop, with path normalization.
@jmeyer: Hello Seems I'm encountering a small bug with Pinot UI Context: • I've got 2 tables with 2 separate schemas • I'm on the `Query Console` Tab Problem : • I click on a table, the schema appears below it - the schema is valid (the one corresponding to this table) • I then click on the other table, the schema of the previous table remains • If I reload the page and click on the other table first, its schema remains after I click on the other table Info: • The schemas really are different • Actually the API calls when inspecting the page return the correct schemas • So it seems it only is a UI-related issue • I don't think I've seen this issue before • Force reloading Firefox page (ignore cache) doesn't solve the issue Ultimately it's only a UI inconsistency, so not a big deal, but got me thinking about my schemas for a bit :smile:
@jmeyer: If the issue remains, I'll create a Github issue :slightly_smiling_face:
@sanket: Can you please create GitHub issue @jmeyer I'll try to reproduce and update the issue ticket.
@mayanks: Thanks @sanket
@jmeyer:
@jmeyer: BTW, is there an Issue template or some guidelines ?
@sanket: I don’t think there is any issue template yet
@sanket: Thanks for reporting this issue @jmeyer Fixed and created the PR for this issue: cc: @mayanks @g.kishore @npawar
@mayanks: :ship:
@jmeyer: Fantastic, thanks @sanket !
@sanket: @sanket has joined the channel
@havjyan: Hello everyone ! Existed to be part of this slack channel. I am slowly learning Apache Pinot and Superset. Dose anyone know how to expedite the following error `Apache Pino Error` `{'errorCode': 410, 'message': 'BrokerResourceMissingError'}` `This may be triggered by:` `Issue 1002 - The database returned an unexpected error` This is what I got from superset documentation _Your query failed because of an error that occurred on the database. This may be due to a syntax error, a bug in your query, or some other internal failure within the database. This is usually not an issue within Superset, but instead a problem with the underlying database that serves your query._ What's the best fix for this?
@mayanks: The error in Pinot means that it did not find the table. Can you try your query directly on Pinot's query console?
@havjyan: The data that I am trying to query dose not show up in tables .When I run the select commend in SQL editor I get the same BrokerResourceMissingError .
@mayanks: There are two possibilities, you there's a typo in the table name in your sql query, or the table is not setup in Pinot correctly
@havjyan: Checked for typos everything looks good. The data is being pulled by a bootstrap script from github.
@mayanks: What table name do you see in the query console, and what happens when you click on that?
@mayanks: You should see something like this in the query console:
@havjyan: I have the same
@mayanks: And you are querying `baseballStats` table and you get BrokerResourceMissing? That does not make sense. Just clicking the table name would fire a query and show you results.
@havjyan: no the table that I am trying to Query is not showing in Pinot. I guess there is a problem with the Pinot ingestion job. It was suppose to downlead the raw data from the FTP server.
@mayanks: Yeah so if you don't see the table in query conole, then it does not exist in Pinot, and that would explain the error you are seeing.
@havjyan: so there is a bug somewhere in the ingestion script I am assuming
@mayanks: Not necessarily.
@mayanks: Will need more info on what exactly are you runnign and how
@havjyan: I am have followed the step from this repo and successfully got Pinot and Superset working on my localhost. Except I was never able to see the data in the Pinot witch resulted the error in Superset.
@mayanks: Hmm, this is not from official Pinot repo. @kennybastani seems like you built this. Could you please take a look?
@gabuglc: Hello, what is the best way to define a pinot schema when connectiing with kafka via schema registry?
@surendra: Hi, Good Evening. We are seeing below WARNs in controller logs and segment start & end times are negative , what will be the root cause? ```2021/04/16 16:59:54.827 WARN [TimeRetentionStrategy] [pool-10-thread-5] REALTIME segment: <>__0__982__20210312T0906Z of table: <>_REALTIME has invalid end time: -9223372036854775808 MILLISECONDS 2021/04/16 16:59:54.827 WARN [TimeRetentionStrategy] [pool-10-thread-5] REALTIME segment: <>__0__983__20210312T0936Z of table: <>_REALTIME has invalid end time: -9223372036854775808 MILLISECONDS 2021/04/16 16:59:54.827 WARN [TimeRetentionStrategy] [pool-10-thread-5] REALTIME segment: <>__0__984__20210312T1006Z of table: <>_REALTIME has invalid end time: -9223372036854775808 MILLISECONDS```
@mayanks: Hmm, your segment name seems to be `<>`?
@mayanks: Because your table name is `<>`?
@surendra: No, I replaced with `<>` sorry for that :slightly_smiling_face:
@mayanks: Lol, ok
@mayanks: Let me look at the code on why this might happen
@ssubrama: The segment seems to have been created on March 12. Perhaps the segment has invalid data on time column? Did you by any chance change the schema since then? What is the retention on your table?
@surendra: retention is 30 days
@mayanks: Long.MIN_VALUE seems to suggest it is null in the incoming stream?
@ssubrama: If you have access to the server, you can check the min and max values of the time column in the rows in the segment
@surendra: Ok, will try that
@ssubrama: You will find it in the metadata of the segment, in the director where the segment files are stored. A text file called `metadata.properties`
@mayanks: He did and it is Long.MIN_VALUE, suggesting that incoming values are likely null
@surendra: ```{ "id": "<>__0__1000__20210312T1413Z", "simpleFields": { "segment.crc": "1640078893", "segment.creation.time": "1615558396792", "segment.end.time": "-9223372036854775808", "segment.flush.threshold.size": "100000", "segment.flush.threshold.time": null, "segment.index.version": "v3", "segment.name": "<>__0__1000__20210312T1413Z", "segment.realtime.download.url": "s3://<>/pinot/<>/<>__0__1000__20210312T1413Z", "segment.realtime.endOffset": "62577410", "segment.realtime.numReplicas": "1", "segment.realtime.startOffset": "62565619", "segment.realtime.status": "DONE", "segment.start.time": "-9223372036854775808", "segment.table.name": "<>_REALTIME", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "11791", "segment.type": "REALTIME" }, "mapFields": {}, "listFields": {} }```

#pinot-dev

@kuantian.zhang01: @kuantian.zhang01 has joined the channel
@phuchdh: @phuchdh has joined the channel
@phuchdh: Hi there, In my company, we use to authen with gcs. So anyidea to using plugin GcsPinotFS without GCP_KEY file ?
@mayanks: @fx19880617 ^^
@fx19880617: right now you have to use it. Any issue prevent it?
@fx19880617: A typical solution is to make the key file a secret and mount it into the container
@phuchdh: SecOps not approved export key file :smile:. So i think i only have an option write a custom PinotFS plugins
@fx19880617:
@fx19880617: You can try to add a new option here to use the key provided in your preferable way
@fx19880617: Please modify it and submit a PR, we can help review the code
@phuchdh: ok, Thanks for the help!
@selvakumarcts: @selvakumarcts has joined the channel
@sleepythread: @sleepythread has joined the channel
@btripathy: @btripathy has joined the channel

#complex-type-support

@amrish.k.lal: This is a summary of how we are planning on moving forward with JSON querying in Pinot: . Happy to discuss further and can set up zoom meeting either Monday or Tuesday afternoon to go over this in more detail.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]