Apache Pinot Daily Email Digest (2021-06-30)

Pinot Slack Email Digest Wed, 30 Jun 2021 19:00:34 -0700

#general

@sajjan: Hey Karin, I was looking for near real-time reporting tech for our system and Pinot seems great. We can need other tools along with pinot. So it good to be here guys.
@mayanks: Great to have you, feel free to ask questions as you try out Pinot
@nicolas.richard: @nicolas.richard has joined the channel
@frederic.vanharen: @frederic.vanharen has joined the channel

#random

@nicolas.richard: @nicolas.richard has joined the channel
@frederic.vanharen: @frederic.vanharen has joined the channel

#troubleshooting

@jmeyer: Hello I'm seeing odd behavior with filtering (WHERE predicate) A column contains LONG values (single valued), and filtering only works for some of the values in that column Here's the data:
@jmeyer: Here's a query that works:
@mayanks: IIRC, there was a UI issue that was causing this. Do you see the same issue when making curl calls to Pinot from shell?
@mayanks: The UI issue was resolved, but may not have made it to 0.7.1
@jmeyer: Here's one that doesn't (the id is a copy paste from the data, no doubt its valid)
@jmeyer: @mayanks Thanks for the pointer, I'll check the API output
@jmeyer: Output really is empty from the API
@jmeyer:
@mayanks: Hmm, the UI calls the api internally
@jmeyer: Yes, and the API really returns no results for some of the ids
@mayanks: I see, are you saying that both api and UI don't work in the same way?
@mayanks: what's the query?
@jmeyer: They do, I'm saying it doesn't look like a UI bug
@jmeyer: (FYI I'm on version `0.8.0-SNAPSHOT-46009e152b8f56c244e415beefa81dbc626de7cb`)
@jmeyer: > what's the query? NOK: `select * from test_communities where entityId = 47923728534576490` OK: `select * from test_communities where entityId = 47923728366804330`
@jmeyer: Really odd...
@jmeyer: @mayanks If you don't mind, it may be easier if I can do a short demo whenever you've got time
@ken: What happens with `select * from test_communities where entityId = '47923728534576490'`?
@jmeyer: @ken Same result (as in, no change in behavior for both queries)
@mayanks: Where did you get the id to query from? If the id was from UI, it might have been incorrect to begin with due to UI issue
@jmeyer: *More context:* • Table contains 17 documents • Table was created (definition, schema) using the API, data was fed via the `ingestFromFile` API endpoint (unit testing)
@jmeyer: > Where did you get the id to query from? If the id was from UI, it might have been incorrect to begin with due to UInissue Indeed, I copy pasted from the UI Which works for one ID, but not the others Do you recommend me another way ?
@mayanks: _javascript_ truncates values in wierd ways
@mayanks: Can you get the id using select * from curl command and then try to repro?
@jmeyer: `curl localhost:8099/query/sql -d '{"sql": "select * from test_communities"}' | jq .` ```{ "resultTable": { "dataSchema": { "columnNames": [ "communityId", "entityId", "entityType", "parentDepartmentId" ], "columnDataTypes": [ "STRING", "LONG", "STRING", "STRING" ] }, "rows": [ [ "604795e710fc330f65e9100e", 47923728366804330, <<< TEST ID.1 "user", "null" ], [ "604795e710fc330f65e9100e", 47923728534576490, <<< TEST ID.2 "user", "null" ], [ "604795e710fc330f65e9100c", 62535436887531140, "department", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 62535436887531080, "department", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 62535436954639944, "department", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 47923728685571440, "user", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 47923728803011950, "user", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 62535436887531080, "department", "de2ba60543be995f958e1e46" ], [ "604795e710fc330f65e9100c", 47923728685571440, "user", "de2ba60543be995f958e1e46" ], [ "604795e710fc330f65e9100c", 62535436954639944, "department", "de2ba60943be995f958e1e47" ] ] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 1, "numSegmentsProcessed": 1, "numSegmentsMatched": 1, "numConsumingSegmentsQueried": 0, "numDocsScanned": 10, "numEntriesScannedInFilter": 0, "numEntriesScannedPostFilter": 40, "numGroupsLimitReached": false, "totalDocs": 17, "timeUsedMs": 4, "offlineThreadCpuTimeNs": 398035, "realtimeThreadCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 10 }``` Now take the first returned ID (`entityid`) -> `47923728366804330` *Query :* `curl localhost:8099/query/sql -d '{"sql": "select * from test_communities WHERE entityId = 47923728366804330"}' | jq .` ```{ "resultTable": { "dataSchema": { "columnNames": [ "communityId", "entityId", "entityType", "parentDepartmentId" ], "columnDataTypes": [ "STRING", "LONG", "STRING", "STRING" ] }, "rows": [ [ "604795e710fc330f65e9100e", 47923728366804330, "user", "null" ] ] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 1, "numSegmentsProcessed": 1, "numSegmentsMatched": 1, "numConsumingSegmentsQueried": 0, "numDocsScanned": 1, "numEntriesScannedInFilter": 17, "numEntriesScannedPostFilter": 4, "numGroupsLimitReached": false, "totalDocs": 17, "timeUsedMs": 4, "offlineThreadCpuTimeNs": 342519, "realtimeThreadCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 1 }``` -> OK Now take the second ID -> `47923728534576490` *Query :* `curl localhost:8099/query/sql -d '{"sql": "select * from test_communities WHERE entityId = 47923728534576490"}' | jq .` ```{ "resultTable": { "dataSchema": { "columnNames": [ "communityId", "entityId", "entityType", "parentDepartmentId" ], "columnDataTypes": [ "STRING", "LONG", "STRING", "STRING" ] }, "rows": [] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 1, "numSegmentsProcessed": 1, "numSegmentsMatched": 0, "numConsumingSegmentsQueried": 0, "numDocsScanned": 0, "numEntriesScannedInFilter": 0, "numEntriesScannedPostFilter": 0, "numGroupsLimitReached": false, "totalDocs": 17, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 383400, "realtimeThreadCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 0 }``` -> NOK
@jmeyer: Pretty surprising :open_mouth:
@jmeyer:
@jmeyer: Logs don't show much but here they are
@jmeyer:
@mayanks: I missed where did you get `47923728366804330` and `47923728534576490` from?
@jmeyer: From the query results of the first query, column `entityId` (so, 2sd element from the arrays)
@jmeyer: Here -> ```"rows": [ [ "604795e710fc330f65e9100e", >>>>> 47923728366804330 <<<<<, "user", "null" ],```
@jmeyer: I can repro the issue after dropping the table, schema & data and setting things back up I'll try to restart the cluster, start it from scratch, using the latest Docker image *=> Same behavior* :confused:
@mayanks: Can you file an issue with steps to reproduce? I can get to it later today/tomorrow
@jmeyer: @mayanks Will do, okay
@mayanks: Cc @jackie.jxt
@jackie.jxt: @jmeyer Where did you get the json response? If it is from the query console "showing json", then the value is already truncated (_javascript_ only preserve 52 precision bits)
@jackie.jxt: Here is the issue: It is not fixed yet
@jackie.jxt: Can you try querying broker directly using curl:
@jackie.jxt: It should give the accurate result
@jmeyer: @jackie.jxt That's what I've tried too
@jmeyer: See
@jackie.jxt: Hmm, how about without `jq`?
@jmeyer: It's only for formatting the result Not sure I see then, which query do you want me to try without it ?
@jackie.jxt: ```jackie@MacBook-Pro ~ % echo '12345543211234554321' | jq . 12345543211234554000```
@jackie.jxt: I think it also truncates the value
@jmeyer: Ooooh interesting !
@jmeyer: Let me try, nice catch
@jmeyer: @jackie.jxt @mayanks @ken Sorry for the disturbance, @jackie.jxt is right, `jq` was having the same issue as Pinot's UI, altering the LONG values...
@jmeyer: So, no issue, case solved :slightly_smiling_face:
@jmeyer: Thanks again @jackie.jxt
@elon.azoulay: Hi, we are getting a 500 error (with no logs) when doing queries with Integer.MAX_VALUE as the limit on pinot 0.7.1 - I see from the code that should be the max limit. Is there something else we are running into? (there are no table quotas). Here is the full error, redacted hosts and table names to maintain super secrecy:) : ```Unexpected response status: 500 for request {"sql" : "select AAA, BBB from CCC limit 2147483647" } to url :XXXX/query/sql, with headers {Accept=[application/json], Content-Type=[application/json]}, full response null```
@elon.azoulay: Should I create a github issue? I can also help and work on this, lmk.
@elon.azoulay: The controller has a stack trace but not the broker, and there are no logs other than that the request was run (in pinotBroker.log)
@elon.azoulay: Stack trace is basically that the post failed (from controller). Also server does not have any logs (maybe request never makes it that far)
@mayanks: Is there an OOM?
@elon.azoulay: I don’t see an oom or anything in the logs. The main consumer of memory (ran jmap ) was the async logger. Also the table only has 4K rows. When I lower the limit the query runs. Then I raise it back and the query fails. No crashes or ooms.
@elon.azoulay: Have you ever seen this? I am retrying in local docker with debug logging (on a test table though) - maybe I can find debug messages…
@elon.azoulay: Ok, more debug info, I see that it made it to the server:
@elon.azoulay:
@elon.azoulay: And broker parsed the query just before that (about 10ms before):
@elon.azoulay:
@elon.azoulay: So it looks like something that happened after broker received the response from the server...
@elon.azoulay: Server completed the response: ``` Processed requestId=2,table=alltypes_REALTIME,segments(queried/processed/matched/consuming)=1/1/1/1,schedulerWaitMs=5,reqDeserMs=0,totalExecMs=4,resSerMs=0,totalTimeMs=12,minConsumingFreshnessMs=1625069569193,broker=Broker_f068e4137900_8099,numDocsScanned=11,scanInFilter=0,scanPostFilter=176,sched=fcfs```
@jackie.jxt: I can reproduce the issue, let me look further into it
@elon.azoulay: Thanks!
@elon.azoulay: @jackie.jxt looks like it occurs after the `reduceOnDataTable` call in SingleConnectionBrokerRequestHandler::processBrokerRequest
@elon.azoulay: Should I create a github issue for it?
@jackie.jxt: @elon.azoulay Yes, please create an issue and we can post the findings there
@elon.azoulay: will do
@nicolas.richard: @nicolas.richard has joined the channel
@frederic.vanharen: @frederic.vanharen has joined the channel

#aggregators

#pinot-dev

@leon.liu: @leon.liu has joined the channel

#pinot-docs

@amrish.k.lal: Hello, I am wondering how I can update Pinot documentation? I will like to make doc changes corresponding to recent json querying functionality () and maybe create a tutorial dealing with json queries ().
@mayanks: @amrish.k.lal please join <#C023BNDT0N8|pinot-docsrus> and check the header of the channel
@mayanks: @mayanks set the channel topic: This channel is moved to <#C023BNDT0N8|pinot-docsrus>

#pinot-perf-tuning

@leon.liu: @leon.liu has joined the channel
@leon.liu: Hey good morning. I read some articles about Pinot, and feel Pinot can be a great tool for our real time analytics platform. we currently use snowflake and redshift. I tried it with a simple usecase (63 million records with percentileest, avg aggration) on a single ec2 instance and the performance is amazing. I want to pursue further and have a few questions related with building star tree index for the aggregators. mainly we want to make sure building the star tree indexes takes much shorter than the full cubing. hope you can help me out: 1. for our percentile aggregation, we only care the values for 10, 25, 50, 75 and 90 percent. is there any way to do the aggregation only for those percentiles? 2. How do i know if a star tree index is built? from the UI “Reload Status” screen, I don’t see anything related with the star tree index 3. currently we are doing very intensive monthly cubing to support realtime analytics (percentile on 12 columns, avg on 12 columns, approx_cont_distinct on 5 columns). at the end of each month, we are batch feeding about 70 million records. is it possible to build the star tree index in a couple of hours? if so what are the recommended ways to speed up the index building process? some context for our table: 1. 40 dimension columns, 1 time column and 15 metric column 2. we have monthly feed about 70 million records 3. we need monthly, quarterly and yearly analytics Thanks in advance
@mayanks: Hello: ```1. The TDigest based percentile size remains same regardless of what percentiles you want to query (unless there's a hidden feature that I am unaware of). 2. If you have access to the segment dir on server, you can check the segment folder, there would be a startree index. But if you file an issue, we can expose it in some fashion. 3. Build time depends on data size and the configuration you specified, but could be possible.```
@mayanks: What's your latency requirement? And have you tried without startree index if the requirement can be met?
@ken: For Q3 - You can use the Spark or Hadoop (MapReduce) job runner to build segments in parallel. If you configure your table to have indexes generated when the segment is being built, you avoid some potential CPU/memory bottlenecks when pushing these segments to the cluster. This all works well if your Pinot cluster has access to HDFS (or some other shared filesystem), which you can configure as your Pinot cluster’s deep store.
@mayanks: +1 to what @ken said ^^
@leon.liu: sub-second is our requirement. we also are client facing with high concurrency
@mayanks: what's the read qps?
@leon.liu: 200 qps is good enough for us
@leon.liu: if we use star tree index, the query performance will be better that we are looking for, the only concern is how long it needs to build the index. right now it takes 12 hours in AWS to do the full cubing with spark.
@leon.liu: we are using a lot of percentile aggregation in our query, i tried it without using any index, one query returns in about 6s. I’m relatively new here, not sure if there is any other way to make it much faster and avoid heavy indexing.
@leon.liu: @ken our data is in AWS s3. For the Q3 suggestion, is there any example i can take a look for reference? or some more detailed documentation will greatly help
@mayanks: How many docs did the 6s query take? If too many (>100k), then star tree is the right index
@leon.liu: for the 6s query, it scans all of the docs i loaded (63 million)
@leon.liu: if i load all of the data (36 months), the total docs will be above 1 billion

#getting-started

@leon.liu: @leon.liu has joined the channel

#pinot-docsrus

@mark.needham: I made a PR that puts a link to the multi valued columns docs next to where all the column types are described -
@mark.needham: I thought there was gonna be some sort of list or array type, I didn't realise it was a property that you could toggle
@atri.sharma: @atri.sharma has joined the channel
@amrish.k.lal: @amrish.k.lal has joined the channel
@ken: @ken has joined the channel
@xiangfu0: @xiangfu0 has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-06-30)

#general

#random

#troubleshooting

#aggregators

#pinot-dev

#pinot-docs

#pinot-perf-tuning

#getting-started

#pinot-docsrus

Reply via email to