Apache Pinot Daily Email Digest (2020-12-14)

Pinot Slack Email Digest Mon, 14 Dec 2020 18:00:54 -0800

#general

@onumlala: @onumlala has joined the channel
@onumlala: Hi, I'm paying attention to Minion's GDPR support. I read the document that the minion framework can be used to achieve the requirements to comply with GDPR. However, the detailed description is "coming soon." I'm confused. Is the ability to use the Minion framework to delete records under certain conditions in the background not yet available? Or is it just the document that hasn't been written yet? In addition, I have some questions about audit, authorization, and DR. 1. Audit at the query level. I need to know not only table config and schema change log, but also who, when, and what queries (including target tables and conditions) were requested. Does Pinot offer audit? Or is it possible to use minion to monitor queries in the background and log them? 2. Is Pinot planning to provide authentication-authorization modules? Druid provides the built-in kerberos authenticator and provides authorization through the ranger extension program. have any similar plans? 3. I want to configure replication between two data centers (not using cloud) Ideally, if data center 1 fails, we want to fail over to data center 2 and fail back when Data Center 1 is normal. Suppose I have configured deep storage (hdfs), pinot cluster on k8s in each of the two data centers. Deep storage replication is possible. But what happens to real-time data? I understand that real-time data stores data in memory and periodically flushes segments to disk. If a cluster down, will real-time data that has not yet been flushed be lost? I'm not sure how to configure DR on pinot. Is there any way to recommend it to me? I'm in the process of getting to know Pinot. Thanks in advance for the help.
@fx19880617: I think LinkedIn is already using it for deleting records in background for GDPR. @ssubrama @mayanks may have more information. For your questions: 1. for query part, pinot logs query context in broker logs, for user level, Pinot doesn’t collect right now, it should come with AuthN/AuthZ ; 2. yes, it will be supported. Pinot currently has an ACL interface for user to plugin their own logic as well 3. Pinot keeps the start offset of each segment to guarantee no data loss. When server fails, current consuming data is in memory, so it will be gone. Once pinot come back online, Pinot reset kafka consumer offset to the saved segment start offset and re-ingest the data.
@mayanks: Yes, there is a SegmentPurgeTask that can be used to purge records for GDPR.
@ssubrama: To clarify some more, Pinot does not replicate the data that it receives from realtime stream. It is expected that (1) the stream is replicated underneath to a different data center, so that the other data center can serve during the disaster (2) all records in the stream be re-ingested into the data center that is down so that it can be reconsumed by pinot in the data center that experienced disaster. The second point can be relaxed a bit if you have hybrid (as opposed to realtime only) use case.
@ssubrama: As for minion purges, just clarifying that the task operates at a segment level, purging (or modifying) records as necessary. It is expected that the task executor has access to other databases that indicate which records need to be purged.
@snlee: Currently, there’s no out of box task scheduler for purge tasks or the record purger implementation in the open source code We do have all the building blocks. @laxman is working on the default purger scheduler implementation.
@laxman: Actually, @jackie.jxt is actively working on this. We had a discussion around pinot minion improvements last week. Please respond with your comments
@michael: How are text_match regex's performed? I'm looking for string contains type queries (eg. Text_match(column, '/.**partial_term.* */')). Normally I would look for lucene ngram tokenization for this but see Pinot isn't using this. How are the partial term regex's completed? Is this essentially raw regex performed against all tokens?
@pabraham.usa: @michael from my limited experience to do a partial match you just put in in quotes Text_match(column, ‘*partial_term*’)).
@michael: Without the regex it seems to only match on word tokens, regex with wildcard it will match of partial words but not sure how efficient the processing is
@michael: Partial term errors for me if using prefix wildcard
@michael: Without regex
@karinwolok1: Join us tomorrow for the last Pinot meetup for the year 2020! :wine_glass: The Pinot community has grown from 100 to 800 members this year :astonished:. We want to take this opportunity to thank the entire Pinot community and get your inputs on our 2021 roadmap. In this fireside chat, we will go over all the things we have accomplished together in 2020 and talk about all the fantastic indexing techniques available in Pinot. Afterward, we'll open up for questions and discussions about Pinot and its roadmap. Sign up here -
@karinwolok1: :tada: Welcome new Pinot :wine_glass: members this week!!! @onumlala @lnc.adoni @gergely.lendvai93 @chun.zhang @radoslav.nikolov @pparkar @michael @atoildw @andy108 @hua.michael.chen @leedohyun
@pradeepgv42: Hi everyone, about a quarter back we added Apache Pinot into our product which has helped us build real-time analytics feature into our product. This blog post has some benchmarks and reasons why we choose Apache Pinot over alternatives. Hopefully anyone who is new to Pinot might find it useful and for others it might be a fun read. Also, I just want to say thanks to the Pinot Community for being super helpful to us through this journey :)
@dharakkharod: Hi, while testing offline table ingestion on pinot github i found that the `overwrite` mode is called `refresh` now and got an error while using the `overwrite` as a segment push type. Is the `overwrite` keyword not valid anymore?
@npawar: afaik, it was called REFRESH from the start.
@npawar: just to confirm, you’re referring to the “segmentPushType” field in the table config, and not the “overwriteOutput” field in the ingestion spec: ?
@dharakkharod: yeah `segmentPushType` , was it always between `append` and `refresh` ?
@npawar: yup
@chinmay.cerebro: Hi all ! We have made a list of things we can work on in 2021 and would like to get valuable inputs from the community to prioritize accordingly. Please vote or add additional stuff at: . Looking forward to hearing from all of you !

#random

@onumlala: @onumlala has joined the channel

#troubleshooting

@onumlala: @onumlala has joined the channel
@tanmay.movva: Hello, I’ve set replicas per partition to 1 for llc streaming ingestion. Whenever pinot fails to ingest records from kafka, (in our case it is schema registry restarts) it throws error and set the segment state to offline. Even after the issue is resolved, I don’t see the consumption being resumed/retried. I tried triggering the reload of the offline segments but it did not have any affect. What else can I do to resume consumption?
@tanmay.movva: Recreating the table solves the issue on non-prod environments. But I wonder how to tackle this issue in production environments.
@chinmay.cerebro: @tanmay.movva I believe this issue is tracking this problem:
@npawar: For consuming segments that get marked offline, there is a periodic task that will correct them. Once it runs, the consumption will be restored. That task runs every hour by default. You can increase the frequency via controller configs if you want it to run sooner
@ken: Hey all, I’m now running a segment generation/push that’s using HDFS for input/output. The relevant bits in the job file for input/output dir are: ```inputDirURI: 'hdfs://<clustername>/user/hadoop/pinot-input/' includeFileNamePattern: 'glob:**/us_*.gz' outputDirURI: 'hdfs://<clustername>/user/hadoop/pinot-segments/'``` When I run the job, segments are generated, but then each segment fails with something like: ```Failed to generate Pinot segment for file - hdfs:/user/hadoop/pinot-input/us_2020-03_03.gz java.lang.IllegalStateException: Unable to extract out the relative path based on base input path: hdfs://<clustername>/user/hadoop/pinot-input/``` So it looks like the input file URI is getting the authority (`<clustername>`) stripped out, which is why the `baseInputDir.relativize(inputFile)` call fails to generate appropriate results in `SegmentGenerationUtils.getRelativeOutputPath`. Or is there something else I need to be doing here to get this to work properly? I’m able to read the files, so the `inputDirURI` is set up properly (along with HDFS jars).
@fx19880617: I think this is a bug, can you give an example of dir uri and file uri?
@fx19880617: The authority is stripped out when fs list is happening ?
@ken: I’m guessing the authority is being stripped somewhere, as the inputDirURI is correct, the files are being read, and it’s only when trying to create a relativized path for writing the files that the input file URI no longer contains the authority bit
@ken: Dir URI is `'hdfs://<clustername>/user/hadoop/pinot-input/'` (in job yml file). But input file URI is `hdfs:/user/hadoop/pinot-input/us_2020-03_03.gz`
@ken: (no clustername)
@fx19880617: ic
@fx19880617: Does it work if you remove the cluster name from input dir uri?

#pinot-docs

@amrish.k.lal: Hello, This is the PR for doc change for percentile functions:
@steotia: In the past I have been able to directly edit gitbook without having to raise a PR. Has that changed now?

#getting-started

@myeole: @tingchen @fx19880617 I see lot of files are written to S3 under same timestamp but i do see error on controller as well as server. I see on cluster manager console, segment keep showing consuming…. We are tying to use split commit feature but even setting split.commit to true for both controller and server , we do see "isSplitCommitType":false in server error. error on server logs [LLRealtimeSegmentDataManager_pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] [pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] CommitEnd failed with response {"isSplitCommitType":false,"streamPartitionMsgOffset":null,"buildTimeSec":-1,"status":"FAILED","offset":-1} Error on controller logs [SegmentCompletionFSM_pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z] [grizzly-http-server-1] Caught exception while committing segment file for segment: pullRequestMergedEventsAwsMskDemo__0__1__20201214T1851Z .IOException: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: E62169F11317304B, Extended Request ID: 3dlRY25FjPWIVJsA82PfQnhwlyp/26Nw1VM2xZCzlqEUvNSIXpFSexbvMewbLTR3ZuaDSHE6rq8=) This is my controller.conf controller.helix.cluster.name=pinot-cluster controller.port=9000 controller.local.temp.dir=/var/pinot/controller/data controller.data.dir= controller.zk.str=pinot-zookeeper:2181 pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-west-2 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher controller.allow.hlc.tables=false controller.enable.split.commit=true .hostname=true This is my server.conf pinot.server.netty.port=8098 pinot.server.adminapi.port=8097 pinot.server.instance.dataDir=/var/pinot/server/data/index pinot.server.instance.segmentTarDir=/var/pinot/server/data/segment .hostname=true pinot.server.instance.realtime.alloc.offheap=true pinot.server.instance.segment.store.uri= pinot.server.instance.enable.split.commit=true pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=us-west-2 pinot.server.segment.fetcher.protocols=file,http,s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcherroot@pinot-server-0:/opt/pinot#
@fx19880617: seems the missing aws credential, did you set the environemnt variables for accesskey and secretkey
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]