Apache Pinot Daily Email Digest (2022-03-09)

Pinot Slack Email Digest Wed, 09 Mar 2022 18:00:33 -0800

#general

@gxm.monica: hey everyone, What size data do you store on pinot, how many machines are used and what are the machine configurations like?our current business is about PB size, but we store in a different way from pinot.we use HBase to store fields' inverted index and write row position in another hbase's table.Then we fetch filtered records from HDFS.we use some technics to reduce random IO, like compression,encoding, store data in batching, cache, etc. Because our data are stored as a row-format, it's really bad when query results hit large numbers. As far as i know, I guess when a query needs to read large segments(if it can't prune data on partition, star-tree...), is it painful for pinot, cause pinot may need to download lots of segments from segment store and rebuild each segment's index in servers' memory?
@mayanks: Hi, there are a wide variety of use cases that Pinot powers from varying data and cluster sizes. Happy to help understand your use case better and provide suggestions
@mayanks: Today, Pinot serving nodes maintain a local copy of the segments for serving. So there is no download involved in the query path.
@gxm.monica: Currently,we use presto to execute query on our storage ,firstly getting query's record ids after our inverted index filtering,then fetch whole hitting records into presto and do the rest of execution. most of our use cases are like `select * from table where day between (2017,2022) and fieldA like *AAA* `,there are also some aggregation queries, but it's painful when hitting records are huge because random IO. As far as I know,I founld pinot use technics like rich indexes, segment assignments.I'm thinking if our queries match lots of segments, do we need to hold lots of segments in server memory? And after queries, what will be held for a segment on server side, like different indexes? I guess lots of scan data will hurt caches too..
@mayanks: Servers do have a copy of segments on the local disk, and memory map them. So during query execution, whatever is needed is pulled into memory from local disk (not deep-store).
@mayanks: Pinot’s indexing techniques will avoid puilling/reading any data that is not relevant for executing queries.
@gxm.monica: I see, thanks a lot :)
@saumya2700: @saumya2700 has joined the channel
@weixiang.sun: When ingesting the streaming data from kafka, how to concatenate array of strings from one source column to destination column as part of ingestionConfiguration?
@mayanks: You can write a groovy function for that?
@g.kishore: may be add Joiner udf
@srishb: @srishb has joined the channel
@gxm.monica: Hey everyone, is there any configuration to let inverted index, bloom filter, etc persist in a segment? If so, for a segment, will server use less memory size when reading inverted index, like server can only hold inverted index or bloom filter for a segment in memory?
@mayanks: Indexes are part of persisted segment and are memory mapped, which means they are pages in and out as needed.
@pavel.stejskal650: Hello! What’s efficient way to filter records in case of multi-valued colomuns, e.g. List<String>, n=4 and we want to filter all documents by 1 value in value set. Is forward and inverted index efficient here? Or is better to split mutli vals column to more columns? What’s recommended design for: 1. for fix N, 2. for variable N (e.g. 1 to 10) Thank you
@mayanks: Inv index works for MV columns as well. You want to make sure that the semantics of MV columns serves your purpose, if so, you can use it. For example, a row will match if at least one value in the MV column matches the filter.
@gxm.monica: Hey everyone, I found pinot text index only support standard analyzer, is there any plan to support custom analyzers, like elasticsearch ? Or could you give me some advice how to support it better if we do this feature?
@mayanks: Not at the moment. Could you describe the use case where you would need this in Pinot?
@ken: @gxm.monica we worked around this limitation by doing the analysis in a Flink workflow, and using the resulting terms in a multi-valued string field that we used for queries (filtering). It doesn’t do true phrases, but we generate both one and two-term strings, and we do the same analysis for the user query, so it (almost) eliminates any false positives.
@mayanks: Thanks @ken, always appreciate you help.
@g.kishore: Thanks Ken. For my own understanding, whats the use case for custom analyzers.. It should be easy to make the analyzer pluggable.
@jkinzel: @jkinzel has joined the channel

#random

@saumya2700: @saumya2700 has joined the channel
@srishb: @srishb has joined the channel
@jkinzel: @jkinzel has joined the channel

#troubleshooting

@saumya2700: @saumya2700 has joined the channel
@srishb: @srishb has joined the channel
@yash.agarwal: I have 20 data nodes in my cluster, but all the queries are only using 19 nodes. All the 20 nodes are enabled and have segments assigned. This is also happening for all the tables, what can i do to troubleshoot ?
@mayanks: Is there server online in external view? Do the queries need the data in that server to generate correct result? Are results correct?
@awadesh.kumar: Hi all, I deleted all the segments from a pinot table using below endpoint: `http://{base_url}/segments/trips?type=REALTIME&retention=0d` Now Pinot table has stopped receiving the data from the kafka topic. We didn't change anything in table configuration. Any possible reason for this?
@mayanks: What was the reason for deleting all data? And where do you want Pinot to start consuming from? As a quick fix you can try to delete and recreate the table.
@mayanks: Cc: @npawar @navina perhaps we should not allow deleting of consuming segments by default?
@awadesh.kumar: Hi Mayank, We made `singleValueField` as `false` for a field of the table. There were negative values(default long value) for this field in the table which is invalid in our use case. So for this clean-up we deleted all the segments as it was testing data. After the segments deletion, data is not getting consumed in the table. I know it wouldn't be a recommended approach for a live system. Is there any better way to update/delete specific records from query console.
@mayanks: What you are describing is a rebootstrap case. For offline tables you can simply push data and overwrite. For real-time table, there is no push support yet (coming soon), so for such backward incompatible changes this is the only way I can think of
@awadesh.kumar: ok thanks for the update. Any reason for data is not getting consumed in the table? Any fix except recreating tables?
@npawar: The last segment of each position is in consuming state.If consuming segment is deleted, the consumption stops. We have an open issue to add recovery mechanism for this.
@npawar:
@luisfernandez: bumping this for your thoughts
@jkinzel: @jkinzel has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2022-03-09)

#general

#random

#troubleshooting

Reply via email to