Apache Pinot Daily Email Digest (2022-04-29)

Pinot Slack Email Digest Fri, 29 Apr 2022 19:00:38 -0700

#general

@ysuo: Hi team, is it a requirement to enable partitioning in Pinot to use upsert feature?
@ysuo: I mean, if I had set “routing”: { “instanceSelectorType”: “strictReplicaGroup” }, “upsertConfig”: { “mode”: “FULL” } in the table config and set primary key in the schema, will upsert takes effect without setting segmentPartitionConfig?
@mayanks: Yes partitioning is a requirement
@ysuo: At the moment, does Pinot set a default segmentPartitionConfig if it’s missing in the table config for upsert feature?
@mayanks: no
@mayanks:
@ysuo: I think there is no segmentPartitionConfig in this example?
@mayanks: SegmentPartitionConfig is separate from upsert
@mayanks: For upsert, the requirement is the the upstream is partitioned by the upsert primary key
@mayanks: SegmentPartitionConfig is for specifiying partitioning that was done upstream (what function was chosen, etc). This is used during query execution to only query partitions for the key in the query. It is separate from upsert, and not needed for upsert.
@ysuo: I see. Thanks.
@ysuo: The following config is just for querying?
@zaikhan: You need to push data to kafka with a key that is also the primary key in pinot schema. for example, if there are two columns `a` and `b` in primaryKeys of pinot schema. Then in your kafka producer (may be flink, spark or anyother job), you need to use both `a` and `b` attribute of kafka message as partitioning key. So that message lands to same kafka partition for specific values of `a` and `b`
@ysuo: Yes, we’re working on it. Thanks. @zaikhan
@cedric.barbin: @cedric.barbin has joined the channel
@jjoinme: @jjoinme has joined the channel
@sunmeet.singh1130: @sunmeet.singh1130 has joined the channel
@wajdi1077: @wajdi1077 has joined the channel
@jplane.tech: @jplane.tech has joined the channel
@jplane.tech: What are segments comprised of?
@jplane.tech: I’m interested in building a segment fetcher that builds virtual segments on the fly from an OLTP transaction log.
@mayanks: You just need to implement record reader interface for your format and the rest of segment generation will happen using existing code
@mayanks: Although I didn’t fully get what you mean by virtual segment
@ken: @mayanks I think @jplane.tech is talking about creating on-demand segments that are backed by the transaction log. Though there’s the triggering of the build…e.g. you could have some kind of smart web service fronting this dynamic builder, so that when an HTTP request is made to get a segment out of deep storage, it handles triggering the segment build (if needed).
@mayanks: I guess I am unfamiliar with the business use case. Why would we do a segment builds on-demand at query time, that would be super slow
@ken: I can’t speak to Joe’s use case, but I did something similar a while ago for a client. They had a lot of data stored in Parquet format, and needed to be able to arbitrarily query a sub-set of it for analytics. So rather than spend the up-front time to turn everything into the proper back-end format (Lucene, as they were using Elasticsearch for the query analytics), we wrapped the Parquet file with an interface that supported the Lucene calls used by ES. To do that we had to build some data structures (mostly bit sets) and cache those.
@ken: So there was latency on the first query for a particular sub-set of data, but after that it was fast.
@mayanks: Ah I see.
@mayanks: Thanks for the context @ken
@jplane.tech: Fantastic question. Let me back up. I think I'm conflating the noun “segments” with what it means in my world. We have many databases, each which have a durable transaction log. We want to support ad hoc queries of aggregated metrics derived from this data. We would like to avoid duplicating this data somewhere else just to do the aggregations ( e.g. spark ). I think the way to accomplish this integration is through some kind of SPI plugin, but am unsure. What would you recommend?
@ken: If you want to use Pinot for this, the simplest approach would be to have some daemon process that converts transaction log entries into records that it pushes to a Kafka topic. Then set up Pinot to consume from that topic and update a realtime table.
@ken: If you want to avoid having to use Kafka, then you could have a daemon that periodically creates segments, saves them someplace reachable from your Pinot cluster, and does a “metadata push” to tell Pinot you have a new segment for that table.
@ken: For that second approach, you could generate CSV files that are then processed as-is by the Pinot admin tool to build segments. Or you could (as @mayanks noted) write a record reader that is then used by the Pinot admin tool to directly read from the transaction log files.
@jplane.tech: Could you point to the api docs for the “metadata push”? These I'll evaluate these approaches, thanks for the help!
@ken: See
@mdai: @mdai has joined the channel

#random

#troubleshooting

@ysuo: Hi, what’s the best type choice for current in Pinot? Is it STRING?
@richard892: can you paraphrase your question? What problem are you trying to solve?
@ysuo: sorry, it’s a typo. I mean, currency or money.
@ysuo: For mysql, it’s decimal type.
@richard892: ok
@richard892: do you have multiple currencies or only one?
@richard892: there isn't a built in data type for this, but if you convert into cents you can store as a `LONG`
@richard892: however, aggregation will performed using floating point numbers, so is inexact
@kharekartik: We have support for `SUM_WITH_PRECISION` but it only works for big decimal columns which are currently stored as byte[] For better performance, we already have an effort in progress to add `DECIMAL` with fixed precision data type support
@cedric.barbin: @cedric.barbin has joined the channel
@saumya2700: hi All , if segments are offline how to make them online, on of the table has offline state it has been recovered automatically but other three tables segments are in offline state.
@mayanks: Check debug api in swagger. Typically it implies an underlying error that the debug endpoint might surface. You can also look at the server log for those segment names. Restarting the servers usually helps
@jjoinme: @jjoinme has joined the channel
@luisfernandez: hey all, we continue to have issues with zookeeper on gke sadly, our sandbox environment got its space filled up does anyone know how to recover from this scenario, pretty much the entire system is sad at the moment
@g.kishore: make sure that the you are purging the transaction logs and snapshot logs on Zookeeper machines
@diogo.baeder: I don't know how to recover, but I had the same problem and fixed it by actually configuring a proper deep store with S3, so that my segments don't end up being all in local disks.
@g.kishore:
@luisfernandez: we have deep storage retention for offline servers 2 years, retention for only 7 days
@g.kishore: if you clean up the space and restart, everything should come back up
@luisfernandez: we have that setup
@luisfernandez: ```clientPort=2181 dataDir=/data/snapshot dataLogDir=/data/log tickTime=2000 initLimit=10 syncLimit=10 maxClientCnxns=60 minSessionTimeout= 4000 maxSessionTimeout= 40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 4lw.commands.whitelist=* server.1=pinot-zookeeper-0.pinot-zookeeper-headless.pinot.svc.cluster.local:2888:3888 server.2=pinot-zookeeper-1.pinot-zookeeper-headless.pinot.svc.cluster.local:2888:3888 server.3=pinot-zookeeper-2.pinot-zookeeper-headless.pinot.svc.cluster.local:2888:3888```
@luisfernandez: the autopurge
@g.kishore: ```dataDir=/data/snapshot dataLogDir=/data/log```
@g.kishore: whats the size of this and is it filled up?
@sunmeet.singh1130: @sunmeet.singh1130 has joined the channel
@wajdi1077: @wajdi1077 has joined the channel
@tonykim: Hi all, I’m trying to do batch ingestion with spark according to this . It seems that the current pinot version 0.10.0 doesn’t include some dependencies and the documentation is recommending to use `0.11.0-SNAPSHOT` instead. (I got some runtime issues - Class not found exception). Does anyone know how can I find the 0.11.0-snapshot binary?
@tonykim: I found that the following slack troubleshooting discussion was similar to my issue.
@mayanks: @kharekartik
@jplane.tech: @jplane.tech has joined the channel
@mdai: @mdai has joined the channel

#pinot-dev

@mayanks: Are we good with this PR functionally?
@walterddr: looks good to me mostly will do a sweep today
@mayanks: Thanks

#getting-started

#introductions

@cedric.barbin: @cedric.barbin has joined the channel
@jjoinme: @jjoinme has joined the channel
@sunmeet.singh1130: @sunmeet.singh1130 has joined the channel
@wajdi1077: @wajdi1077 has joined the channel
@jplane.tech: @jplane.tech has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org