Apache Pinot Daily Email Digest (2021-03-22)

Pinot Slack Email Digest Mon, 22 Mar 2021 19:00:36 -0700

#general

@geoyogesh: @geoyogesh has joined the channel
@quietgolfer: Any design recommendations for Pinot setups that need to deal with data protection requirements of different locations where certain personal data should remain in location boundaries (e.g. GDPR)? Do people try to setup global tables and use Server and Segment definitions to limit scope? Or do people create separate tables?
@mayanks: I am unsure if there is a way for you to force (moreover guarantee) a segment be on a certain server (in a certain geo location).
@mayanks: Which leaves option 2 I guess.
@mayanks: For your case, would you end up with too many tables (one per geo location boundary)? Perhaps something that can be explored as a feature request (in which case, do you mind opening an issue?).
@quietgolfer: Ah okay. Good to know. It’s fine if we have multiple tables. I’d imagine this number will grow with time. I’m curious to see what other companies do. I imagine this is a common issue.
@mayanks: For us, the GDPR requirements have been more around purging PII data when needed. And for that, we have setup a Minion job to perform the purging. I haven't personally run into the requirements you mentioned, although it makes sense to me, which is why I was exploring if more folks need it, perhaps we can open a request for the same.
@ravi.maddi: Hi All, I have a doubt, If there is nested JSON(Very large nested entities at-least 5 to 7 levels of embedded jsons entries) . Which is better way of doing schema for that 1. Flatten the JSON -- Schema becomes un-scalable 2. Store Embedded JSONs(JSON indexing concept), and use JSON Evolution functions, but it showing very high time taking. I saw one technical session on Nested Indexing, they said , if one million records there, JSON evolution function might take 10 to 15 seconds to get result. Could you please tell me which is better way. How to design schema for nested JSONs.
@fx19880617: For large json, the decoding cost is there, so depends on your query load and latency requirement, you can decide if you want to flat it or not. Typically you can try to flat all the fields you know that will be frequently queried, for the others. @jackie.jxt has implemented json indexing, but I’m not sure if that would work for 5-7 levels json.
@ravi.maddi: Well, I got , can you help me which once you suggest please. Flattening is very difficult in my scenario. We have array of embedded objects and array size is not predictable. so
@ravi.maddi: Can we connect and discuss. Can you share your available time please.
@jackie.jxt: I'm available after 4PM PT today, does it work for you?
@jackie.jxt: You can also just try out the json index and see how it performs
@klijeesh: @klijeesh has joined the channel
@gagandeep.singh: @gagandeep.singh has joined the channel
@karinwolok1: Welcome new :wine_glass: Pinot slack members!!! Curious who you are and how you found the Pinot community! Want to share what you're working on? @klijeesh @gagandeep.singh @allison.t.murphy22 @nhat @rahul.kabra.corp @geoyogesh @kis @gaythu.rajan @rahul @sameerasalameh95 @sunxiaohui.bj @savannahjenglish @lam010210 @amherman @virtualandy @osman @asif @brianolsen87 @ali @xd @jamesmills @simon.paradis @pranasblk @rkitay
@jamesmills: Hi @karinwolok1 - Currently production user of Apache Kafka and researching ways to provide analytical report as part of our architecture. A pub sub model which is designed to scale on demand (message lag)
@klijeesh: Hi I am Lijeesh. I am working in Walmart Labs. We are evaluating Pinot for a real-time analytical store use cases.
@brianolsen87: Hey @karinwolok1 :wave: , I'm Brian, a dev advocate in the Trino () community :rabbit2: . I hopped over here initially to tell folks about where we invited @elon.azoulay and @fx19880617 to speak about Pinot and the and for me I really wanted to convey, what use cases each system really solves and why :rabbit2: :heavy_plus_sign: :wine_glass: is such a great combination. I look forward to getting to know everyone here. Feel free to reach out about any info about Trino. Commander Bun Bun says Cheers!

#random

@geoyogesh: @geoyogesh has joined the channel
@klijeesh: @klijeesh has joined the channel
@gagandeep.singh: @gagandeep.singh has joined the channel

#troubleshooting

@tanmay.movva: Hello, I am facing difficulties in flattening and transforming json records from kafka. This is the structure of the json record ```{ "event_name": "abcd", "event_type": "", "version": "v1", "write_key": "", "properties": { "status_code": "some_code", "status": "some_status", "mode": "some_mode" }, "event_timestamp": 1616157914, "mode": "live" }``` And my schema looks like this ```"mode": "string", "request_failure": "INT"```
@tanmay.movva: I want to define the `mode` columns as `$.properties.mode` , which is just simple json flattening, From the docs, I was not able to understand the correct syntax for `jsonPathString` to use in the tableConfig.
@tanmay.movva: And the `request_failure` column is a derived column based on `$.properties.status` . I got to know after reading docs that chaining transformations isn’t supported in pinot, So I can’t define a column `status = $.properties.status_code` and then use it to define the other columns as `request_failure = if(status == 'created', 1, 0)` . So I think, I need to write a groovy script to extract the value from nested json and apply the if/else logic. But to extract a value from nested json, I would have to import json slurper in groovy(not so familiar with groovy, but this is what I found on SOF/internet to parse json in groovy). So my question here is, does pinot support import statements in the groovy script? If not, how can I achieve this transformation in pinot?
@tanmay.movva: To give context around the use case, this use-case is currently onboarded to druid and then alerts are setup in thirdeye on top of the datasource in druid. We were facing issues with query response time because of one column. So we are trying to migrate the datasource to pinot and then try out star tree indexing with a hope that we would see much better performance. As of now, we are facing issues in migrating the datasource from druid to pinot(All the required transformations are supported in druid).
@tanmay.movva: Do let me know, if more info is required.
@tanmay.movva: @npawar
@tanmay.movva: We are running the image with `latest` tag.
@npawar: Chaining is supported now
@tanmay.movva: Great!! Then it would be straightforward ig. Can you help me with the syntax here
@tanmay.movva: And also I am trying to apply `filterConfig` , but it isn’t working. Here is my tableConfig ```{ "tableName": "x_fts_merchant_events_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "schemaName": "x_fts_merchant_events_dimensions", "timeColumnName": "event_timestamp", "timeType": "MILLISECONDS", "replicasPerPartition": "1", "retentionTimeValue": "1", "retentionTimeUnit": "DAYS", "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "LowLevel", "stream.kafka.topic.name": "x-fts-events-kafka", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "kafka-kafka-bootstrap.kafka.svc.cluster.local:9092", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" }, "loadMode": "MMAP" }, "metadata": {}, "ingestionConfig": { "filterConfig": { "filterFunction": "Groovy({event_name == 'abc'}, event_name)" } } }```
@npawar: "transformFunction": "jsonPathString(properties, '$.mode')"
@npawar: this demo also uses it:
@npawar: filter config looks right, it should work
@npawar: why do you think it’s not working? are you still seeing event_name ‘abc’ in the ingested data?
@tanmay.movva: Yes. I created two tables, with and without filterConfig. In the table without filter config I am able to see the record. But not in the table with filter config.
@tanmay.movva: > are you still seeing event_name ‘abc’ in the ingested data? Not able to see this. Ideally, I should be able to.
@npawar: the way you’ve added the filter config, “abc” will be filtered out. That means it wont be included in the data
@tanmay.movva: Oh!. I thought if the condition is satisfied, then the record would not be skipped(this was the case in druid, so..). My bad. Will update the filter and try it again.
@tanmay.movva: > Chaining is supported now Just to confirm. I first extract the field from json using inbuilt function, then I can use groovy to derive a field from the extracted column, correct?
@npawar: Correct, should be supported in the latest. Was added about a month back
@tanmay.movva: I think the image was pulled 5 days ago. Will try these and update the thread. Thanks Neha!
@npawar: Cool :+1:
@tanmay.movva: One more question, when chaining transformations, do the intermediate columns have to be present in the schema? Is it possible to just use them in other transformations and exclude them from schema?
@npawar: Hmm, they will have to be in the schema.
@g.kishore: Chaining transform function is supported
@tanmay.movva: All the suggestions worked. Thanks!!!
@npawar: Great! :)
@geoyogesh: @geoyogesh has joined the channel
@harshvardhan.surolia: @harshvardhan.surolia has left the channel
@klijeesh: @klijeesh has joined the channel
@humengyuk18: Hi team, how do I quote a reserved word like “timestamp” when querying pinot in Presto? I tried using double quotes but it’s not working, any workarounds ?
@g.kishore: whats the error? @fx19880617 ^^
@fx19880617: TIMESTAMP is a reserved keyword in presto for data type, try to avoid using that for your own convenient
@fx19880617: you can try if using ` will help
@gagandeep.singh: @gagandeep.singh has joined the channel
@elon.azoulay: We scaled up our brokers and saw that for about 15 minutes some brokers were returning "RESOURCE_MISSING_ERROR" - when I looked in the zk browser for INSTANCES/broker-<x>/CURRENTSTATES it appears like the END_TIME values spanned that 15 minute interval. To avoid this should users get their client by contacting zookeeper, i.e. brokers for table?
@elon.azoulay: Or should this not have happened?
@elon.azoulay: Currently users just set the pinot broker hostname to the kubernetes service, and there is only 1 broker tenant.
@mayanks: Typically, there should be a load-balancer/vip in front of the brokers that should listen to external view to ensure that brokers are ONLINE before it can route requests.
@mayanks: But 15 minutes seems like a long time
@elon.azoulay: But if the broker is up but not yet ready to serve requests how do we tell? We just use the readiness probe from the k8s chart.
@mayanks: Broker is ready when it is ONLINE on external view.
@elon.azoulay: Thanks - so maybe we are getting our clients incorrectly then? i.e. just using the k8s service (which only adds endpoints when they are online). Judging from the messages in zk browser:INSTANCES/<broker>/CURRENT_STATES the earliest table online occurred 17 mins before the last one

#pinot-dev

@geoyogesh: @geoyogesh has joined the channel
@khushbu.agarwal: Hi, does pinot by default create sorted forward index on a column of type datetime?
@mayanks: Hi, no this is not the case.
@mayanks: Are you seeing any issues, or just curious? Trying to get the question behind the question.
@khushbu.agarwal: Just curious to know
@khushbu.agarwal: Also the intent is whether adding such a index will improve the query performance? Considering these values will be unique in all the rows. I was considering sorted forward index on raw values.
@jackie.jxt: Currently we treat datetime column the same as regular columns except for retention management. Also, sorted index is always dictionary encoded, and cannot be applied directly to raw values
@jackie.jxt: If the values are sorted within the segment, Pinot will detect that and generate sorted index automatically
@lam010210: @lam010210 has joined the channel

#getting-started

@geoyogesh: @geoyogesh has joined the channel
@gagandeep.singh: @gagandeep.singh has joined the channel

#releases

@geoyogesh: @geoyogesh has joined the channel

#pinot-startup

@ravi.maddi: Thanks Uday,
@ravi.maddi: @vallamsetty & @mayanks -- I have a doubt, Mine is saas solution. We have to maintain Every client data seperately. For example , Event count stats are there. We maintain Event count stats for each client, Like, Event-count-stats_clientID. In this case I have doubt that i. can We have to write on schema one client like Event-count-stats_1, Event-count-stats_2, .... or , but this solution is unscallable. ii. we can use one segment for one client with same schema. Is it possible. Could you please guide me what is the best way of doing this scenario.
@mayanks: Need to understand: ```1. When you say need to maintain every clients data differently, is this about privacy, or some compliance requirement? 2. What's the issue with schema like <client_id, event_count>?```
@ravi.maddi: Hi Uday and Mayank I have a doubt, If there is nested JSON(Very large nested entities at-least 5 to 7 levels of embedded jsons entries) . Which is better way of doing schema for that 1. Flatten the JSON -- Schema becomes un-scalable 2. Store Embedded JSONs(JSON indexing concept), and use JSON Evolution functions, but it showing very high time taking. I saw one technical session on Nested Indexing, they said , if one million records there, JSON evolution function might take 10 to 15 seconds to get result. Could you please tell me which is better way. How to design schema for nested JSONs.
@mayanks: Do you need all fields in the JSON record? Or are you only interested in some fields? If so, you can use transform functions.
@mayanks: You can find some examples of transforms here:
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]