Apache Pinot Daily Email Digest (2021-01-07)

Pinot Slack Email Digest Thu, 07 Jan 2021 18:00:33 -0800

#general

@rchandel: @rchandel has joined the channel
@gamparohit: @gamparohit has joined the channel
@tangyonga: Hi, Team, a streaming app often does the following: 1. Read local files using flume into kafka 2. Do ETL transformation from kafka topic using flink 3. Pull data from flink into Linkedin's Pinot So, I am not doing direct map from kafka to pinot table just like , any suggestion or example can help me, thanks!
@g.kishore: You can write the output of Dlink to another Kafka topic for real-time ingest or use Pinot segment generation api to do periodic batch uploads
@tangyonga: thanks reply. @g.kishore In reality, I wish to write output of flink/dlink into pinot in memory way.
@g.kishore: Pinot does not have a write api as of now...
@g.kishore: write happens via Kafka
@g.kishore: its something we plan to add in 2021
@tangyonga: Sure, I am ready to write the output of Dlink to another Kafka topic for real-time ingest, looking forward to write api :thumbsup:
@vinulam: @vinulam has joined the channel
@tangyonga: Hi team, Uber makes a contribution about Schema inference for saving a lot of manual effort. I think that while landing production, this capability is important. So, any plan for adding the capability into 2021 roadmap or currently has been implemented? Thanks! ()
@yupeng: Hi Mark, at Uber we didn’t do this feature in Pinot layer
@yupeng: but a workflow management system that integrates Pinot as a connector, but I think it’s a good idea to abstract this logic and make it part of util/tooling. @changliu
@tangyonga: I see, thanks @yupeng!
@sankalp.jain02: @sankalp.jain02 has joined the channel

#random

#troubleshooting

@rchandel: @rchandel has joined the channel
@gamparohit: @gamparohit has joined the channel
@vinulam: @vinulam has joined the channel
@sankalp.jain02: @sankalp.jain02 has joined the channel
@ken: If I do a query with a `where mvfield in ('a', 'b') group by mvfield`, and `mvfield` is a multi-valued field, I get a result with groups for values from `mvfield` that aren’t in my where clause. I assume I’m getting groups for every value found in `mvfield` from rows where `mvfield` contains a match for my filter, but it seems wrong…am I missing something?
@mayanks: You need the `value_in` udf. Let me find an example
@mayanks: @jackie.jxt ^^
@ken: Thanks for the hint, found an example in the code. Should be `select VALUEIN(mvfield, 'a', 'b') as groups from table group by groups`
@mayanks: Yeah, there you go
@ken: Where should I add documentation for this UDF?
@mayanks: Seems the doc exists:
@mayanks: Perhaps suggest where you would have looked.
@ken: OK - I was looking in documentation about grouping. So maybe a note there about when grouping on a multi-valued field that’s also been used for filtering?
@mayanks: yeah, makes sense

#pinot-dev

@gamparohit: @gamparohit has joined the channel
@sankalp.jain02: @sankalp.jain02 has joined the channel

#discuss-validation

@mayanks: Thanks @chinmay.cerebro. I left a general comment in the doc to request being a little forgiving for grey areas (listed examples). In our case, there are codes on client side which may auto generate these configs, and hence a one-time cleanup does not always help.
@mayanks: Just to clarify, I do support adding these checks, just saying we could be a little forgiving to accept "replication: 3" as well as "replication" : "3"
@chinmay.cerebro: @mayanks yeah I had the same point. we definitely want to make sure we dont break those things
@chinmay.cerebro: it'll be too much work to go and fix all those existing table configs
@mayanks: Yeah, for our case, those table configs are code generated, so that makes it harder.
@chinmay.cerebro: +1
@chinmay.cerebro: I'lll see if that is possible. @mohammedgalalen056 FYI
@ssubrama: I have added comments. Overall: (1) In the realtime case, please do not preclude someone from developing a new stream plugin by introducing kafka specific checks. Let the plugin check thngs that it is supposed to do. You can make the plugin provide a schema if need be. (2) Introduce a level of checking in the controller that we can crank up as we tighten things. At least a boolean flag of "strict" vs "loose" default is the latter, and we can turn it to "strict" over time (one release of pinot should be sufficnent) and then drop the config altogether
@chinmay.cerebro: that's a great point
@chinmay.cerebro: sounds good. Thanks for the review @ssubrama
@chinmay.cerebro: @mayanks we'll definitely look at all the existing concerns. But at a high level - is it easy to test such validation code before we check in ?
@chinmay.cerebro: within LinkedIn ?
@mayanks: It is definitely possible, but it is somewhat of a moving target (due to code generated configs). Also, we need to be aware when to do it, and need to assign someone to do it.
@chinmay.cerebro: I see
@chinmay.cerebro: the good thing about having a schema is you can plug it in on the producer side as well
@chinmay.cerebro: for better validation
@mayanks: So this code is not ours, there are pinot users who have written all kinds of code/scriptware etc to generate these, and most of the time we aren't even aware of their existence.
@chinmay.cerebro: I see
@chinmay.cerebro: well all the more reason for early validation :slightly_smiling_face:
@chinmay.cerebro: I modified the schema to allow both 3 and "3" FYI
@chinmay.cerebro: this does mean we cannot enforce additioanl constraints (for eg: min / max )
@chinmay.cerebro: but I think its ok for now
@mayanks: I agree, we do need validation for sure. As Subbu suggested, we can allow for level of checking (strict vs loose) and move from loose to strict over time for deployments where that is needed.
@chinmay.cerebro: upon further reading, it doesn't look like json-schema has tunable validation levels. We might have to skip validation checks for a lot of fields (eg: inverted indices) to account for custom code generated table config.
@chinmay.cerebro: at this point - I'm not sure if json-schema will work fo rus
@g.kishore: then just add rules, we can still have json-schema based validation but thats just one of the rules
@chinmay.cerebro: you mean, be able to configure json-schema as one of the validation mechanisms ?
@chinmay.cerebro: we can do that
@g.kishore: yes

#pinot-perf-tuning

@ken: I added two more servers to my cluster, and performance has dropped. One theory is that one or both of these new servers is slower than the previous servers, and thus causing the drop in performance. How can I confirm or refute that theory? Are there Pinot metrics I should be examining?
@mayanks: You can check server side latency metric. That will tell you the latency from individual servers, and identify if the new ones are slow.
@mayanks: Alternatively, the broker logs latency it sees from individual servers too.
@ken: Broker logs are good option, as I was hoping for something that wouldn’t require me to rig up metrics just yet.
@ssubrama: If you scatter queries to N+2 servers instead of N, that may increase latency due to GC (assuming all servers are of same capacity). The probability that any one server is delayed due to GC is now higher. Of course, it depends on N. I don't think this may be observable if N is 10, but it may well be if N is 2.
@ken: N is 3 (went from 3 to 5). But seems unlikely to be GC related, as (a) it’s repeatable, and (b) time went from about 300ms to 900ms consistently.
@mayanks: Yeah, the broker log will tell you exactly which server took how long, and you can deduce broker side time (which will slightly increase due to more work for 'gather' phase).
@mayanks: Also, are you adding more nodes to reduce latency or improve throughput?
@mayanks: If latter, adding more replica groups might be better than adding more servers to a single replica (or not using replica groups)
@ken: Adding nodes to reduce latency
@ken: @mayanks the pinotBroker.log file does have the info I’m looking for, but it seems to not be getting flushed right away. Is this something I can force flush, or change the flush interval?
@mayanks: Not sure, perhaps there's a log4j setting?
@ken: You’re right - there’s an `immediateFlush="false"` flag in the pinot-broker-log4j2.xml file.
@mayanks: Oh nice, I actually didn't know.
@ken: And just FYI, the change in performance was due to a change in the star tree index that happened to be made at the same time, where net-net was we wound up with a lot more nodes in the tree. `#toomanymovingparts` :slightly_smiling_face: Thanks again for the help.
@mayanks: I see. Glad you were able to find the root cause.

#getting-started

@gamparohit: @gamparohit has joined the channel

#pinot_website_improvement_suggestions

@karinwolok1: @karinwolok1 has joined the channel
@karinwolok1: @karinwolok1 set the channel purpose: How we can improve
@chethanu.tech: @chethanu.tech has joined the channel
@kennybastani: @kennybastani has joined the channel
@karinwolok1: Hi! Just going to throw in the bucket some random suggestions. Currently, we have an Apache Medium but I do not think those are linked to the website... ? I'm thinking maybe we should also create a "community" tab? There we can add community calendar and a link to slack and meetup group. What do you guys think?
@chethanu.tech: No, I was thinking to add Blog in Website only or some of them can be linked in website
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]