Apache Pinot Daily Email Digest (2021-07-06)

Pinot Slack Email Digest Tue, 06 Jul 2021 19:00:35 -0700

#general

@benjamin.djidi: @benjamin.djidi has joined the channel
@trustokoroego: @trustokoroego has joined the channel
@trustokoroego: Hi Everyone :wave:
@alvaradojl1986: @alvaradojl1986 has joined the channel
@karinwolok1: Hey all! Join us for this meetup today! Starting in 1.5 hours! Presentations by @elon.azoulay and @jackie.jxt
@karinwolok1: In case you missed the meetup, you can watch it here! Slides from @elon.azoulay 's presentation are also available in the description!
@ken: We generate OFFLINE segments via Hadoop, and sometimes these are updates to existing segments. In that case we want the segment names to match exactly (so that it’s an update). For most segments this is fine, as we partition by month. But there are cases where we also sub-partition by a non-date field. In this situation I don’t see a way to leverage the `SegmentNameGenerator` interface to give us a deterministic name. If we could key off of the input (CSV) file name then it would be easy, as we’ve got full control over that. Any ideas?
@mayanks: For REFRESH tables (which don't have time column), the segment naming scheme is something like <tableName>_idx. Does that not work?
@mayanks: BTW, there's an issue opened recently about the exact same requirement as yours
@mayanks: Looking for contributions :wink:
@ken: No, because our segment names will be something like `<table name>_<country>_YYYY-MM` but for the US it’s `<table name>_us_YYYY-MM_idx`, e.g. `ads_us_2020-08_0`
@ken: For cases where we don’t have that final index (sub-partition) it’s easy to ensure exact name matching. But with the US data, we need to sub-partition by a field we use frequently in star tree indexes, so that we get maximum gain.
@ken: Thanks for the ref to the issue - yes, this is very similar to what we need.
@ken: Added some questions to the issue you referenced.
@joshhighley: if a table exists for multiple tenants, is it possible to restrict query results to a single tenant?
@mayanks: What do you mean by tenant here?
@joshhighley: the Tenant component of Pinot
@joshhighley: we need to segregate client data
@joshhighley: well, looking at docs, can I specify multiple tenants when creating a table? ```"tenants": { "broker": "myBrokerTenant", "server": "myServerTenant" },```
@mayanks: A table can only have one tenant for server and one for broker. A tenant can be shared across tables
@joshhighley: well, dang. So if we need to segregate data by client (tenant) then each table requires a unique name?
@mayanks: No it does not
@mayanks: You can have a single table on single tenant and have all clients data on the same table?
@joshhighley: no -- our clients don't like their data mixed.
@mayanks: Then have separate table per client?
@joshhighley: each client needs their own 'customer' table, as an example
@mayanks: Yeah so 1 client - 1 table - 1 tenant if you want to complete separation
@mayanks: Not a scalable mode perhaps
@mayanks: But seems like that is what your customers are asking for
@joshhighley: no, without multi-tenancy, each client would have their own environment. Each environment would have the same tables.
@mayanks: What’s is an environment? Helix cluster? If so, then two helix clusters are completely air gapped and you are fine
@joshhighley: our hope was to have TenantA on BrokerA and ServerA with table 'Customers'. Then also, TenantB on BrokerB and ServerB with table 'Customers'...
@joshhighley: I was using 'environment' in a general sense: a set of servers.
@mayanks: It me it sounds like separate tables? If so, why does the name of table need to be same?
@mayanks: Because customers may end up having their own schema as well in future?
@mayanks: Note that you cannot have multiple tables with same name in one cluster
@joshhighley: because we have 100s of clients. Managing tables Customer_ClientA, Customer_ClientB, Customer_ClientC gets very cumbersome
@joshhighley: there's lots of tables for each customer also
@mayanks: I think you want same table across all customers but then no two customers can share the same set of brokers/servers?
@joshhighley: right. Their data needs to be kept separate
@mayanks: That is also not scalable if you have 100's of customers. For durability, you will end up having 3 brokers + 3 servers per customer, regardless of what amount of data they have.
@mayanks: One way is to partition the data on customerId. But that will segregate at partition level and not customer level.
@mayanks: Perhaps customers really want is customer level ACL?
@mayanks: If so, that can be built on a mid-tier layer on top of single table in Pinot?
@joshhighley: our customers are financial companies -- mixing data across those companies isn't an option
@mayanks: What you are trying to use the tenant concept in Pinot is not what it is meant for, and doesn't solve your problem.
@mayanks: A table in Pinot can only have one tenant for server and one for broker
@pablomolnar: @pablomolnar has joined the channel
@karinwolok1: Don't miss these 3 awesome meetups next week: Presenters: @jackie.jxt @mayanks @kennybastani @tingchen and Gunnar Morling!
@karinwolok1:
@karinwolok1:
@yhao: @yhao has joined the channel
@b.gilbert: @b.gilbert has joined the channel

#random

@benjamin.djidi: @benjamin.djidi has joined the channel
@trustokoroego: @trustokoroego has joined the channel
@alvaradojl1986: @alvaradojl1986 has joined the channel
@pablomolnar: @pablomolnar has joined the channel
@yhao: @yhao has joined the channel
@b.gilbert: @b.gilbert has joined the channel

#feat-text-search

@b.gilbert: @b.gilbert has joined the channel

#troubleshooting

@benjamin.djidi: @benjamin.djidi has joined the channel
@trustokoroego: @trustokoroego has joined the channel
@prashant.pandey: Hi. We have a K8s Pinot deployment and some of our queries are taking > 10s. We found one conspicuous correlation during our investigation - Latency spikes happen when there is also a spike a YG GC count. In the following charts, spikes happened across the board at 15:28. Does this indicate a possible GC issue?
@mayanks: Need more info. Is this server side? What’s the read qps, and data size on server? What’s the heap size? What kind of queries
@mayanks: What version of Java
@alvaradojl1986: @alvaradojl1986 has joined the channel
@pablomolnar: @pablomolnar has joined the channel
@yhao: @yhao has joined the channel
@b.gilbert: @b.gilbert has joined the channel

#pinot-dev

@atri.sharma: @mayanks @g.kishore I am looking to support nulls in aggregates (a common use case for us). Is there a place where I can get prior thoughts and research, and potential starting ideas?
@mayanks: @atri.sharma there has been some work done with null support in the past, perhaps we can start from where that discussion ended cc @jackie.jxt @chinmay.cerebro
@jackie.jxt: @atri.sharma Does putting a null filter work for your use case? E.g. `SELECT SUM(col) FROM table WHERE col IS NOT NULL`?
@jackie.jxt: The main reason why we didn't directly support nulls in aggregates is because of the performance overhead of per-value null check, and forcing us to use `Object[]` instead of primitive array
@madhu.sling: @madhu.sling has joined the channel

#community

@vaibhav.mital: @vaibhav.mital has joined the channel
@b.gilbert: @b.gilbert has joined the channel

#announcements

@b.gilbert: @b.gilbert has joined the channel

#multiple_streams

@b.gilbert: @b.gilbert has joined the channel

#presto-pinot-connector

@ojasmulay: @ojasmulay has joined the channel

#pinot-perf-tuning

@b.gilbert: @b.gilbert has joined the channel

#getting-started

@madhu.sling: @madhu.sling has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2021-07-06)

#general

#random

#feat-text-search

#troubleshooting

#pinot-dev

#community

#announcements

#multiple_streams

#presto-pinot-connector

#pinot-perf-tuning

#getting-started

Reply via email to