Apache Pinot Daily Email Digest (2021-12-12)

Pinot Slack Email Digest Sun, 12 Dec 2021 18:00:30 -0800

#general

@karinwolok1: Reminder - tomorrow, Monday, December 13 - Apache Pinot 2021 recap and future roadmap discussion! :call_me_hand: :pencil2: Vote on features and improvements you'd like to see in Apache Pinot 2022! :pencil2:
@diogo.baeder: One more question, folks: when it comes to segments of ~200M in size, what segment storage technology would you recommend using when running a cluster in AWS? HDFS? S3? EFS mounted?
@g.kishore: Ebs
@mayanks: Yes, for local storage attached to serving nodes you can use EBS. For deep store you can use S3.
@ken: @diogo.baeder - you can also use HDFS for deep store.
@ken: @g.kishore do you know of any Pinot performance comparisons of EBS vs local SSDs?
@g.kishore: Nothing in a presentable form.
@diogo.baeder: Thanks guys, but which one of those options do you think that gives us the best performance, say, in a scenario of having something like up to 10T in data?
@g.kishore: What’s your qps and latency expectation
@g.kishore: The only options are local ssd or ebs or efs
@g.kishore: S3 hdfs options are only applicable to deepstore which is a backup segment store and will not be accessed during query time
@diogo.baeder: QPS up to a few dozens at max, latency can be seconds but preferably under 1 minute. Thanks for the info, man!
@mayanks: Yeah, you definitely don’t need local SSD for this. As Kishore mentioned, any of the options for network attached disk on serving node will work.
@diogo.baeder: Ah, awesome, thank you guys!
@mayanks: Since the latency is not too tight, you might want to pack a lot of data per instance, so EBS for serving nodes seems good. For deep store, S3 or HDFS both work (S3 is more popular in my personal experience).
@diogo.baeder: Got it. I'll take that into consideration, and also probably go for S3 for the deep store backups (since we already use it a lot for other things)
@ashwinviswanath: If you want latency in seconds ideally, have you considered Hudi?
@diogo.baeder: Not really; I'm not sure what role that would play when integrated to Pinot however
@ashish: Is there any way to extract more than one fields from a json column? jsonextractscalar only allows one field at a time. So, if I do select jsonextractscala(jsonColumn, ‘field1’), jsonextractscalar(jsonColumn, ‘field2’), will it result in parsing the json document twice for each doc/row?
@g.kishore: Parsing will probably happen twice but reading from disk will happen only once
@ashish: Tried various things and figured one could do this: select jsonextractscalar(jsoncolumn, ‘$[“f1”, “f2"]’)
@ashish: There does not seem to be a way to exclude properties in json path _expression_ used by jsonextractscalar. I guess, only way seems to be write my own jsonextractscalars that calls json parser.delete(propertiesToDelete).read(propertiesToFetch) is my understanding right? Any other suggestions?
@g.kishore: Do you have an example of what you are trying to accomplish
@ashish: Basically, the json column is flat map of string -> string and I am trying to do a group by in two different ways: 1. group by key1, key2 2. group by all other keys after excluding key1 and key2 where key1 and key2 are the field key names in the flat map like json column The key names depend on the filter being used - so I cannot convert key1 and key2 to static columns.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-12-12)

#general

Reply via email to