Apache Pinot Daily Email Digest (2021-01-12)

Pinot Slack Email Digest Tue, 12 Jan 2021 18:00:45 -0800

#general

@srini: @srini has joined the channel
@karinwolok1: We have a lot of new :wine_glass: community members in 2021! :wave: Welcome!!! We're curious to know what brought you here! :smiley: Please introduce yourself in this thread! Also, if you have any technical questions, you can ask in <#C01H1S9J5BJ|getting-started> or <#C011C9JHN7R|troubleshooting> @romualdo.gobbo @tsajay101 @lvs.pjx @sri @valentin @aliouamardev @pandey.mayuresh367 @sankalp.jain02 @vinulam @rchandel @gamparohit @john @egala @apandhi @sandun.wed @zxcware @ntyrewalla @kizilirmak.mustafamer @srini
@srini: Welcome Pinot community! :pinot: I come from the Apache Superset community, invited by @kenny I’m a huge viz nerd but also have a stats / ML background. Happy to answer any Pinot <> Superset questions! I also spent the last ~5 years building Dataquest (online learning platform for learning data science) and am always open to discussing careers in data science!
@ranemihir45: @ranemihir45 has joined the channel

#random

@srini: @srini has joined the channel
@ranemihir45: @ranemihir45 has joined the channel

#feat-text-search

@pabraham.usa: Hello, My text index somehow stopped working. it is now giving intermittent results. For eg: following is working `select * from mytable where regexp_likg(log, '0D82F520-62C8-9914-14B8-4C2331E54075')`
@pabraham.usa: But this one will not `select * from mytable where text_match('0D82F520-62C8-9914-14B8-4C2331E54075')`
@pabraham.usa: any pointers how to debug?
@g.kishore: Can you post this on <#C011C9JHN7R|troubleshooting>
@pabraham.usa: ok will do

#troubleshooting

@pabraham.usa: Hello, My text index somehow stopped working. it will give results for some search data however not for all. For eg: following is working `select * from mytable where regexp_likg(log, '0D82F520-62C8-9914-14B8-4C2331E54075')` (edited) But this one will not `select * from mytable where text_match('0D82F520-62C8-9914-14B8-4C2331E54075')` (edited) Any pointers how to debug?
@g.kishore: There was some bug with stop words @steotia ^^
@g.kishore: Is it tokenizing- ?
@pabraham.usa: How can i find out whether it is tokenizing? it seems like some data are not going into text index
@g.kishore: Can you try text match without -
@pabraham.usa: that also not working , text match with `-` however will work for some
@g.kishore: `select * from mytable where text_match('0D82F520')`
@g.kishore: I see
@g.kishore: Does this work?
@pabraham.usa: no this will not work for that particular data
@pabraham.usa: but will work for someother
@pabraham.usa: in my testing all these were working before on an old index, or I it could be I just started bit more extensive testing
@pabraham.usa: I deleted the entire cluster and recreated again, but still no luck
@g.kishore: I don’t think that will fix it, we will try a fix quick test and get back... can you file an issue
@pabraham.usa: sure will do that now
@g.kishore: Looks like a bug to me... is this latest version?
@pabraham.usa: yes I upgraded to latest because of this
@g.kishore: Okay...
@steotia: Hi @pabraham.usa, the other day this was the query cache issue
@steotia: Which you had accidentally enabled on your text index
@steotia: And that was leading to incorrect results. I had suggested to disable it
@steotia: Also you need to enclose the search string as a phrase
@steotia: This was another issue with your queries as they were matching incorrect documents.
@steotia: If you don't use phrase, all of them will get tokenized around hyphen
@steotia: And will be a OR based term query
@pabraham.usa: Thats correct this issue is different
@pabraham.usa: I have cache disabled and also searching with quotes
@pabraham.usa: like
@pabraham.usa: `select * from mytable where regexp_likg(log, '\"0D82F520-62C8-9914-14B8-4C2331E54075\"')`
@pabraham.usa: The issue is for some ids nothing is returned seems like they are no in the text index at all
@pabraham.usa: After bit more analysis it looks like query is fine however for text index the results only start to appear after a while. And it seems text index is skipping segment with status CONSUMING/IN-PROGRESS.
@pabraham.usa: wondering whether this is a bug or I am missing some settings to enable Near Real time searches
@g.kishore: That’s a bug
@contact: Hey question question, we wrote our own plugin for realtime ingestion with google pubsub and in our test we always get one realtime segment by server, even though we configured 1 replica per partition (the stream is high level), do anyone have an idea ? Our ideal setup would be to only have one (so no replica)
@contact: If that helps we have open-sourced the plugin there:
@mayanks: Why not contribute this to the main Pinot project?
@contact: We still didnt put anything in our prod env, so i believe a little bit early
@mayanks: If you open a PR against the main repo, you might get early feedback as well.
@contact: We are still not sure if we want to commit to the pubsub of gcp too, is it still worth to upstream it if we don't use it ourselves ?
@mayanks: Yeah, as long as the impl is good, I am sure someone else might find it useful.
@g.kishore: That’s expected with high level stream consumer
@contact: Not sure to understand why ?
@g.kishore: high level stream combines all partitions of a stream into one stream. splitting it into multiple segments will result in inconsistency and data duplication
@g.kishore:
@g.kishore: this video explains the problems with high level stream consumer and why we chose to implement partition level consumer
@contact: > splitting it into multiple segments will result in inconsistency and data duplication Well i agree on this one thats why i dont get why we have multiple segments
@g.kishore: Multiple parallel segments is required for scaling
@g.kishore: If the event rate is in 100’s splitting is not needed
@g.kishore: But once you reach thousands it helps
@g.kishore: Also it’s unit of parallelism at query time
@g.kishore: It just gives you more options as you scale on ingestion or on query side
@contact: Most of our segment will not be getting more than 500 events/s (if so that would last only few minutes)
@contact: I dont see where i can force to have only one segment for the realtime table
@ssubrama: @contact we also had multiple operational issues with high level streams. Consider the case when you have 4 replicas, and one of the hosts go down. You will need to bring up a new host, and wait until it catches up with the latest offset before you can send queries to it. We also had operational issues when hosts were mistakenly tagged with the same tag, thus splitting the stream between the two.
@ssubrama: I dont know what you mean by "force to have only one segment". For the high level stream consumtption each consumer builds their own segments and keeps it locally, since it can never be guaranteed that the rows consumed by one replica is the same as rows consumed by any other
@contact: i meant to only have one consumer
@contact: from my understanding i have 3 segment (one on each of my server), so i get 3 different consumer
@ssubrama: If you are using high level consumers, as I understand you do, then you should have one segment in progress and the others completed. The older segments will be removed when the retention time is over
@contact: I do use high level consumers, i got 3 realtime segment (for the same realtime table), all of them in progress
@contact: Is there any other place that i can check to verify i have only one consuming ?
@g.kishore: what ever you are seeing is the expected behavior...can you paste your table config
@contact: ```{ tableName: XXXXXX, tableType: 'REALTIME', quota: {}, routing: {}, segmentsConfig: { schemaName: YYYYY, timeColumnName: ZZZZZ, timeType: ZZZZZ, replication: 1, replicasPerPartition: 1, segmentPushType: 'APPEND', segmentPushFrequency: 'HOURLY' }, tableIndexConfig: { streamConfigs: { 'streamType': 'pubsub', 'stream.pubsub.consumer.type': 'highlevel', 'stream.pubsub.decoder.class.name': 'com.reelevant.pinot.plugins.stream.pubsub.PubSubMessageDecoder', 'stream.pubsub.consumer.factory.class.name': 'com.reelevant.pinot.plugins.stream.pubsub.PubSubConsumerFactory', 'stream.pubsub.project.id': XXXXXX, 'stream.pubsub.topic.name': 'unused', // unused but required because the plugin extends the kafka one 'stream.pubsub.subscription.id': ZZZZZ, 'realtime.segment.flush.threshold.time': '15d', 'realtime.segment.flush.threshold.rows': '390000' // 390k rows ~ 200MB (513 bytes / row) // 'realtime.segment.flush.threshold.segment.size': '200M' this option need `realtime.segment.flush.threshold.rows` to be 0 and doesn't work in 0.6.0 (`Illegal memory allocation 0 for segment ...`) }, nullHandlingEnabled: true, invertedIndexColumns: [], sortedColumn: [], loadMode: 'mmap' }, tenants: {}, metadata: {} }```
@contact: From the docs: ```Depending on the configured number of replicas, multiple stream-level consumers are created, taking care that no two replicas exist on the same server host. Therefore you need to provision exactly as many hosts as the number of replicas configured.```
@contact: However in our setup we have one replica with one partition, so i expect to only have one segment (so one consumer).
@srini: @srini has joined the channel
@mohammedgalalen056: @mohammedgalalen056 has joined the channel
@ranemihir45: @ranemihir45 has joined the channel

#discuss-validation

@mohammedgalalen056: I've updated the schema in the docs if it's good we can move forward with opening the PR for making it configurable
@chinmay.cerebro: I'll review it today
@chinmay.cerebro: thanks Mohammed !

#getting-started

@mohammedgalalen056: @mohammedgalalen056 has joined the channel
@ranemihir45: @ranemihir45 has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]