Apache Pinot Daily Email Digest (2020-12-02)

Pinot Slack Email Digest Wed, 02 Dec 2020 18:00:31 -0800

#general

@balci: @balci has joined the channel
@neer.shay: @neer.shay has joined the channel
@mike: @mike has joined the channel
@pabraham.usa: Hi , I have Text index up and running and looks working. However I noticed that some results are not correct for eg:- when I search for `40F916FD-F2A7-2255-FEFB-B43050D8A5EE`. I get results for `81753586-72E1-8DC1-FEFB-08DB16E6A793` & `40F916FD-F2A7-2255-FEFB-B43050D8A5E` . Trying to understand why it is so. Also If I try to search for XML tags like `</ns1:requestControlID>` it throws an error. Is there any setting I can enable to make these searches work?
@mayanks: @steotia ^^
@pabraham.usa: I tried to add escape char and the XML tags search looks working. However I still face the first issue where alpha numeric with hypens search providing wrong results..!
@steotia: Hi @pabraham.usa can you point me to your queries?
@pabraham.usa: @steotia Thanks please see the query select ```select DATETIMECONVERT((timemillis/1000), '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss tz(America/New_York)', '1:SECONDS'),log from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')```
@pabraham.usa: I tried without date formatting and still the same result ```select timemillis,log from local.mytable where TEXT_MATCH(log,'40F916FD-F2A7-2255-FEFB-B43050D8A5EE')```
@pabraham.usa: another example is `75BC2F92-68A6-9237-6101-B6A778F602B8` matching `65FC22F3-68A6-B63C-31F4-AFE236D5DCFF` and `75BC2F92-68A6-9237-6101-B6A778F602B8`
@steotia: "-" is used as a word/term separator during index generation. So essentially the original string is tokenized and FEFB is one of the 5 terms that gets indexed Now on the query side, the same text parser and analyzer is used to tokenize the search query and it essentially becomes a multi term query with OR operator `(40F916FD OR F2A7 OR 2255 OR FEFB OR B43050D8A5EE)` This is the reason why both strings are matching since both of them have the term FEFB You should change the query for Lucene to take it as a phrase query `WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')` this will take the search string as one single phrase and will only match documents containing this as is.
@steotia: @pabraham.usa ^^
@steotia: I have verified this locally
@pabraham.usa: @steotia, I tried it however the phrase query is providing me with same results
@steotia: It worked for me. How are you using the phrase query?
@pabraham.usa: When I changed it to this it started working , `select timemillis,log from local.mytable where TEXT_MATCH(log,'40F916FD AND F2A7 AND 2255 AND FEFB AND B43050D8A5EE')`
@steotia: yes this works too and phrase query works as well `WHERE TEXT_MATCH(log, '\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')`
@steotia: enclosing the search string in double quotes
@pabraham.usa: @steotia I am trying as now ```select timemillis,log from local.mytable where TEXT_MATCH(log,'\"40F916FD-F2A7-2255-FEFB-B43050D8A5EE\"')```
@pabraham.usa: However I stll get the same results
@pabraham.usa: I mean it gives me multiple matches
@steotia: can you list a few matches?
@pabraham.usa: sure
@pabraham.usa: so when I try ```select timemillis,log from local.mytable where TEXT_MATCH(log,'\"DAE49404-3788-B121-B37A-B901D4891622\"')```
@pabraham.usa: I get results matching `2F440BB9-B37A-C808-9DF3-A4BD90954C25` and `DE5A2A09-3788-A426-F194-A72C82F65C33` `03D81725-280E-B37A-C22C-BDDAF65E3525` `DAE49404-3788-B121-B37A-B901D4891622`
@steotia: let me try to create a unit test out of this data and see what happens
@steotia: So I tried a test using the exact same data • Using the phrase query matches the document exactly once as expected • Using the regular query matches multiple documents due to splitting of terms as explained above Wondering why it is different for you. I had recently fixed a text index metadata related bug that can lead to incorrect results. It is not related to the query type. May be you just happen to hit that bug. This can typically happen based on a certain structure of text index directory. May be we can hop on a call and you can show me the index directory. Also, what are the exact contents for the log column in the rows for 2F440BB9-B37A-C808-9DF3-A4BD90954C25 etc. Is it just this or some other text as well?
@pabraham.usa: @steotia Thanks for testing it , I am using the docker image from apachepinot/pinot , I will try with the latest one. I assume I might have to reindex?
@pabraham.usa: The logs contain lots of data
@pabraham.usa: eg:- ```21:44:01,542 INFO filters.ServiceBoundryInitFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Begin Transaction @ 12/2/20 4:44 PM, associated with session [76FE276697C037D04FEFA621C942BBB1]. 21:44:01,544 DEBUG authentication.AnonymousAuthenticationFilter - [2F440BB9-B37A-C808-9DF3-A4BD90954C25] Populated SecurityContextHolder with anonymous token: 02:05:04,574 DEBUG intercept.FilterSecurityInterceptor - [DE5A2A09-3788-A426-F194-A72C82F65C33] RunAsManager did not change Authentication object```
@steotia: So each log line represents a row for log column?
@pabraham.usa: thats correct
@steotia: Ok, my guess is you are probably hitting that bug. So try with latest. Re-indexing is not needed.
@pabraham.usa: great let me try that now...

#random

@balci: @balci has joined the channel
@neer.shay: @neer.shay has joined the channel
@mike: @mike has joined the channel

#feat-text-search

@dungnt: @dungnt has joined the channel

#feat-presto-connector

@dungnt: @dungnt has joined the channel

#troubleshooting

@dungnt: @dungnt has joined the channel
@tanmay.movva: Hello, I am getting this error on one of the realtime tables ```ERROR [LLRealtimeSegmentDataManager_spanEventView__0__15__20201201T0448Z] [spanEventView__0__15__20201201T0448Z] Could not build segment java.lang.IllegalStateException: Cannot create output dir: /var/pinot/server/data/index/spanEventView_REALTIME/_tmp/tmp-spanEventView__0__15__20201201T0448Z-160688315943``` because of which pinot is not able to build segments/ingest data. How to debug this?
@tanmay.movva: Even when pinot is not able to ingest and the lag is increasing, the table and segment status on UI is `GOOD` .
@tanmay.movva: btw, the state of this segment is `CONSUMING`
@g.kishore: Any things else in the log?
@tanmay.movva: No info logs were not enabled. Only error was the above one. But I checked at the path it is trying to create a directory. There already exists a directory in the `_tmp` path.
@g.kishore: Are any other segments getting created?
@tanmay.movva: Yes. Only this table was affected.
@g.kishore: I am guessing you have enough disk
@tanmay.movva: Yes.
@fx19880617: is this directory local or on remote?
@tanmay.movva: This directory is on the volume attached to the pod. So local.
@fx19880617: your volume is local disk or remote like ebs?
@tanmay.movva: ebs volume.
@fx19880617: all are segment persist failed on that volume?
@fx19880617: can you try to create a file on that volume
@fx19880617: try to access it through the pinot server container ?
@tanmay.movva: Was able to do that. Remaining realtime segments are still able to ingest and serve.
@fx19880617: hmm
@fx19880617: is there any log inside the pinot server container
@fx19880617: there is a file pinotServer.log
@tanmay.movva: I deleted and recreated the table and now everything is running fine. Not sure what was the issue. @fx19880617 I will check the pinoServer.log and share if I find anything.
@tanmay.movva: One more thought, when the freshness metric for a table is increasing for any reason(server unresponsive, error connecting to kafka, etc,.) Shouldn’t the status of table be “not GOOD”? When I checked the table and segment status on UI when facing this issue all status were `GOOD`
@fx19880617: agreed
@fx19880617: if you can provide more info, then we can fix this
@fx19880617: also need to distinguish this with the scenario like kafka upstream has no data coming
@fx19880617: in certain cases, we need to define what is “good” status
@tanmay.movva: > also need to distinguish this with the scenario like kafka upstream has no data coming In that scenario, comparing last consumed offset in pinot and latest available offset in kafka should help, yes? Also I am assuming the current freshness metric is measured from current time. If the data is not available in kafka, then this would keep on increasing.
@srsurya122: I tried to execute the pinot start controller cmd It got the following error could you please help me with this?
@srsurya122:
@srsurya122: my requirement is to run it on windows ec2 instance
@taranrishit1234: @taranrishit1234 has joined the channel
@neer.shay: @neer.shay has joined the channel
@neer.shay: Hi! I am interested in creating an infrastructure for monitoring machine learning models running in production and am very intrigued with what Pinot has to offer. I had some questions regarding my use case and it would be great to hear some feedback before I get started. 1. I want to monitor input features & output predictions - here I am essentially interested in anomaly detection and it appears ThirdEye answers this 2. I am interested in calculating business KPIs (precision, recall, accuracy, etc.) once labels are available - is it possible to do during ingestion? Is it possible to run custom scripts (python?) to calculate KPIs during ingest? 3. Visualization - I would like the ability to see data in dashboards as well as slice & dice. What tools are available for this? Thanks in advance for the assistance!
@g.kishore: • Yes, thirdeye will work for anomaly detection and rca • custom scripts - Python is not supported, you can run groovy scripts. • visualization - superset
@neer.shay: Thanks @g.kishore. Regarding custom scripts, do you have any insights as to how will work? What about , can additional logic be added?
@g.kishore: you can add additional logic as long as it does not depend on the previous record for the same key
@dovydas: @dovydas has joined the channel
@mike: @mike has joined the channel
@zjinwei: Hi I'm creating table using Pinot UI, I followed the example and copied the table into UI, but when I save the table, it says "Invalid table config: transcript_REALTIME" can anyone help me?
@npawar: were all the previous steps successful - cluster setup, kafka setup and schema creation?
@npawar: do you have access to the pinotController.log? There should be exception message in there
@zjinwei: Yes, I was deploying pinot on K8s and creating table using restAPI provided by Pinot. the error is ```{ "code": 500, "error": "org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata" }```
@npawar: that means kafka is not reachable
@npawar: you need to change the `"stream.kafka.broker.list": "localhost:9876",` according to your kafka cluster
@zjinwei: which file should I change? I just followed the steps in and creating kafka clusters using ```helm install -n pinot-quickstart kafka incubator/kafka --set replicas=1```
@npawar: @fx19880617 ^^
@npawar: what is the output of `kubectl get all` ?
@zjinwei: `NAME READY STATUS RESTARTS AGE` `pod/kafka-0 1/1 Running 0 46m` `pod/kafka-zookeeper-0 1/1 Running 0 46m` `pod/kafka-zookeeper-1 1/1 Running 0 46m` `pod/kafka-zookeeper-2 1/1 Running 0 45m` `pod/pinot-broker-0 1/1 Running 1 50m` `pod/pinot-controller-0 1/1 Running 0 50m` `pod/pinot-server-0 1/1 Running 1 50m` `pod/pinot-server-1 1/1 Running 1 50m` `pod/pinot-zookeeper-0 1/1 Running 0 50m` `NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE` `service/kafka ClusterIP 10.102.3.210 <none> 9092/TCP 46m` `service/kafka-headless ClusterIP None <none> 9092/TCP 46m` `service/kafka-zookeeper ClusterIP 10.108.54.111 <none> 2181/TCP 46m` `service/kafka-zookeeper-headless ClusterIP None <none> 2181/TCP,3888/TCP,2888/TCP 46m` `service/pinot-broker ClusterIP 10.107.199.233 <none> 8099/TCP 50m` `service/pinot-broker-external LoadBalancer 10.96.55.75 <pending> 8099:30241/TCP 50m` `service/pinot-broker-headless ClusterIP None <none> 8099/TCP 50m` `service/pinot-controller ClusterIP 10.105.81.75 <none> 9000/TCP 50m` `service/pinot-controller-external LoadBalancer 10.107.192.144 <pending> 9000:32385/TCP 50m` `service/pinot-controller-headless ClusterIP None <none> 9000/TCP 50m` `service/pinot-server ClusterIP 10.109.0.248 <none> 8098/TCP 50m` `service/pinot-server-headless ClusterIP None <none> 8098/TCP 50m` `service/pinot-zookeeper ClusterIP 10.108.40.150 <none> 2181/TCP 50m` `service/pinot-zookeeper-headless ClusterIP None <none> 2181/TCP,3888/TCP,2888/TCP 50m` `NAME READY AGE` `statefulset.apps/kafka 1/1 46m` `statefulset.apps/kafka-zookeeper 3/3 46m` `statefulset.apps/pinot-broker 1/1 50m` `statefulset.apps/pinot-controller 1/1 50m` `statefulset.apps/pinot-server 2/2 50m` `statefulset.apps/pinot-zookeeper 1/1 50m`
@zjinwei: seems like all clusters running well
@npawar: in your table config json, change the kafka broker address from “localhost:9876” to “kafka:9092"
@zjinwei: Yes! that works! thank you.
@zjinwei: One more question, after creating schema and table, when I publish data into kafka, `kubectl -n pinot-quickstart exec kafka-0 -- kafka-console-producer --broker-list kafka:9092 --topic transcript-topic < /tmp/pinot-quick-start/rawData/transcript.json`I cannot query them in the UI, seems like there was no data. Do you have some ideas?
@npawar: can you query the kafka topic using kafka-console-consumer to confirm if the topic has data?
@zjinwei: Seems like no. Sorry, I'm new to Pinot and Kafka. I used `kubectl -n pinot-quickstart exec kafka-0 -- kafka-console-consumer --bootstrap-server kafka:9092 --topic transcript-topic --from-beginning` and the responds shows `WARN [Consumer clientId=consumer-1, groupId=console-consumer-65502] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)`
@fx19880617: is this file `/tmp/pinot-quick-start/rawData/transcript.json` in your local or on that container
@fx19880617: you may need to copy that file into container then run that command

#discuss-validation

@mohammedgalalen056: @mohammedgalalen056 has joined the channel
@chinmay.cerebro: @mohammedgalalen056 has pasted the json schema template here:
@chinmay.cerebro: please review this whenever you get time
@chinmay.cerebro: @mohammedgalalen056 I realized its going to be difficult to discuss individual fields on the issue. Maybe we can paste it in a google doc ?
@mohammedgalalen056: OK, also I just updated the schema.
@mohammedgalalen056: Updated the above doc:

#config-tuner

@chinmay.cerebro: please review this when you get a chance:
@steotia: I will review today
@chinmay.cerebro: Thanks a lot
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]