Apache Pinot Daily Email Digest (2021-09-01)

Pinot Slack Email Digest Wed, 01 Sep 2021 19:00:40 -0700

#general

@atri.sharma: Are there examples of Pinot client running multiple concurrent queries?
@david.cyze: @david.cyze has joined the channel
@simone.franzini: @simone.franzini has joined the channel
@qianbo.wang: Hi Pinot experts, I’m new to this analytics realm with Pinot and I have a general question: Does pinot support something like “view” that is common in OLTP? What I’m looking for is a way to optimize frequently used queries that require aggregation over data entries, e.g,: sum of total sales for the past 30, 60, 90 days which aggregates on a designated time column. Another option I’m thinking of is to create separate table for this aggregation which is derived from the fact table and use a scheduled job to update it. Any idea? Thanks in advance!
@ken: The standard Pinot approach would be to define a star tree index with the time column as the dimension and the sales column as the aggregate. That should get you very fast results for pretty much any date range.
@qianbo.wang: That is interesting. I will take a look on that. Thanks!
@mayanks: @qianbo.wang I'd first suggest to evaluate the out of the box performance for your queries. Only if the performance needs to be improved further, you can explore StarTree for partial pre-materialization as @ken suggested.
@qianbo.wang: Thanks. We will benchmark and see would StarTree index helps

#random

@david.cyze: @david.cyze has joined the channel
@simone.franzini: @simone.franzini has joined the channel

#troubleshooting

@gonzalo: Hi, I am trying to run the latest version of Pinot with Docker (Mac) and the container suddenly stops. I don’t see any errors in the log nor are there any other containers running at that time. ```docker run \ --network=pinot-demo \ --name pinot-quickstart \ -p 9000:9000 \ apachepinot/pinot:latest QuickStart \ -type batch``` Does anyone have any idea what might be going on? Please find attached logs
@david.cyze: Not sure (and I'm a very novice pinot user myself) Logs stop after attempting to start the swagger server. Maybe swagger is trying to start on a port that is unavailable, and the exception handling just crashes with no further logs
@gonzalo: thanks @david.cyze, but I think I got it. It was a memory issue. Increasing the memory solve it
@david.cyze: Sure thing :slightly_smiling_face: that was my second guess if you believe it :stuck_out_tongue:
@gonzalo: haha, I do
@david.cyze: @david.cyze has joined the channel
@david.cyze: I'm tasked with doing a Pinot POC for my organization, as we're considering switching to it as our primary data store for reporting data. I followed the guide and was able to create a realtime table ingesting streaming github events. I'm now trying to setup my own realtime table ingesting dummy data with a JSON column and UPSERTs enabled (this will be required for our use case). I have successfully uploaded both a table config and a schema to the pinot controller, and I also created a little app to push dummy data into a Kafka topic. *I confirmed that the data is successfully being added to the topic, however my table is not ingesting any records.* Can someone help me troubleshoot why that may be happening? I will post the table config and schema in this message's thread
@david.cyze:
@mayanks: Any errors on the controller or serve logs?
@david.cyze:
@david.cyze: Just a moment @mayanks, I will give it a look. (Thought I was already tailing them, but it turned out I was looking at the kafka server)
@mayanks: Also what release of Pinot are you using? You can try the debug table api in swagger with latest 0.8.0
@david.cyze: I am on 0.8.0. I was unaware of that API. I'll give that a look too
@david.cyze: ```org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata java.lang.RuntimeException: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata```
@david.cyze: It appears my payloads to the kafka topic are malformed as well. I will debug that and report back
@david.cyze: ```2021/08/31 21:54:12.977 ERROR [JSONMessageDecoder] [simplejson__0__1__20210831T2011Z] Caught exception while decoding row, discarding row. Payload is {"uid":"ad23a2ea-1fac-4a57-8d47-597d3b77a52a","attr_json": {"A": "{"type": "numTickets", "val": 83}","B": "{"type": "numTickets", "val": 51}","C": "{"type": "numTickets", "val": 61}"},"createdDateInEpoch":1570000000247} shaded.com.fasterxml.jackson.core.JsonParseException: Unexpected character ('t' (code 116)): was expecting comma to separate Object entries at [Source: (ByteArrayInputStream); line: 1, column: 70]```
@npawar: You're missing ingestion config in your table config
@npawar: You need to set a transform function on attr_json
@npawar: `"columnName":"attr_json_str", "transformFunction":"jsonFormat(attr_json)"` and change the column name in schema to attr_json_str
@david.cyze: Thank you both. After fixing my data seeding app and adding an `ingestionConfig`, I'm now able to ingest data into the table with a JSON column. I'm seeing some behavior I don't quite understand, however. Prior to adding the `ingestionConfig`, I ingested some rows where `attr_json` was null. After adding the config, I saw new rows where `attr_json` was populated. In my schema, I have defined `uid` as the primary key column. I am seeding 1,000 rows at a time, so I would expect to see `(number of runs prior ingestionConfig * 1,000) + (n runs after config * 1,000)` rows. However, after adding the `ingestionConfig` and seeding 1,000 more rows, my table now has 1,002 rows. My understanding of upserts is that the . This being the case, how is it that so many of my rows were overwritten / deleted?* It is of course exceedingly unlikely that I managed to generate 998 of the same UIDs during my second round of ingestion .* I'm aware that Pinot does not support deletes. I'm using "Delete" here because I'm not sure how else to explain my n(docs) going from 2000 (prior to fixing the ingestion config) to to 1002
@npawar: @jackie.jxt @yupeng
@jackie.jxt: @david.cyze Pinot overwrite records based on primary key only, and the record with newer timestamp is preserved
@jackie.jxt: So the expected behavior should be one record for each different `uid`
@david.cyze: So there is no explanation for why so many records disappeared? I had run two iterations of my faulty ingestion application (ie: before adding the config, thus generating null `attr_json` values). There were 2,000 records before I ran ingestion with the fixed application. That means that the minimum number of records that should have been present would be 2,000 --- assuming the exceedingly unlikely possibility that every randomly generated UID was a duplicate of a previously randomly generated UID
@david.cyze: Note too that if there were an error with the UID generating logic in my application (doubtful -- I used java's `UUID.randomUUID()`) such that each run of my app produced identical `uid` values, the total # of records should never have exceeded 1,000
@david.cyze: When adding a `transformConfig`, does Pinot re-process all records with the updated config? This could explain the record loss: • 2k records where JSON is malformed • update `transformConfig` • pinot re-processes these records; they fail the `transformFunction` ; pinot writes a new segment with them excluded • 0 records now • ingest records with fixed application • 1k well-formed records are ingested (actually, 1,001, as I had an off-by-one "error" in my app and actually generate 1,001 records each run. This doesn't explain why I saw 1,00*2* records, however)
@jackie.jxt: No, pinot won't re-process the already consumed data
@jackie.jxt: Since there are not much data, you may re-create the table to get a fresh start
@david.cyze: Thanks for the suggestion. As I mentioned, I'm doing a POC. Unexplained data loss has me a bit worried, and I will continue to explore to see if anything else pops up
@jackie.jxt: Understood. Once the table is correctly configured, there should be no data loss
@david.cyze: Thank you all for your time and help. It is much appreciated :slightly_smiling_face:
@vibhor.jain: Issue: Multiple issues seen with Pinot 0.8 integration with Prestosql350 (Trino). *1. Selecting Boolean col in the projection list is a problem for both real-time, offline tables. Query throws* select hasVideo from table1 limit 10; Query 20210901_071343_00190_p5w66 failed: Unable to create class org.apache.pinot.common.response.broker.BrokerResponseNative from JSON response: [{"resultTable":{"dataSchema":{"columnNames":["hasVideo"],"columnDataTypes":["BOOLEAN"]},"rows":[[false],[false],[false],[false],[false],[false],[false],[false],[true],[false]]},"exceptions":[],"numServersQueried":7,"numServersResponded":7,"numSegmentsQueried":7,"numSegmentsProcessed":7,"numSegmentsMatched":7,"numConsumingSegmentsQueried":0,"numDocsScanned":70,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":70,"numGroupsLimitReached":false,"totalDocs":70000,"timeUsedMs":5,"offlineThreadCpuTimeNs":3468272,"realtimeThreadCpuTimeNs":0,"segmentStatistics":[],"traceInfo":{},"numRowsResultSet":10,"minConsumingFreshnessTimeMs":0}] *2. Queries not working as expected for DateTime col* Pinot does not have a direct DATETIME datatype and supports STRING, LONG, INT via dateTimeFieldSpecs. Now we have a STRING col in dateTimeFieldSpecs section but when using this col to query via prestosql, it's not working as expected. *3. Alias feature is not working.* Executed a count(*) AS total_calls but resultset shows col name as count(*) only. Alias flow is not taking effect. P.S: We would be raising these concerns with Trino community but thought of sharing it here too.
@mayanks: @elon.azoulay @xiangfu0 ^^
@g.kishore: Thanks for sharing Vibhor.. some of these might be related to connector as well.
@elon.azoulay: This is fixed in the new version of the connector which will support pinot 0.8.0, aliases, boolean types and more function calls as well.
@elon.azoulay: Already have it working locally, should be soon.
@vibhor.jain: Hi @elon.azoulay, can you point me to the link where I could try it? I'm assuming its not officially out.
@simone.franzini: @simone.franzini has joined the channel
@elon.azoulay: Right, still working on it and will push it soon, I'll keep you updated.

#pinot-dev

@steve.reed: @steve.reed has joined the channel

#getting-started

@luisfernandez: hey friends, I have a need in my current project to do stats for ads, (impressions, click_count, click_spent) etc…, now my client has many dimensions they may want to look stuff by (locale, user_id, search query, device etc) … we currently track all of this data thru kafka and was thinking about using pinot to make this data queryable, the user facing dashboard looks at this data by set timeranges and also custom time ranges, I was wondering if pinot is a good candidate for given problem. Right now i’m working in a POC with pinot so would appreciate any insights :slightly_smiling_face: thank you!
@steve.reed: @steve.reed has joined the channel
@tiger: Is there a way to specify the SegmentPush job to only push a single segment instead of a directory?
@npawar: one way i can think of is setting “includePattern” in the yml file. you can find that config in the doc
@tiger: includePattern seems to only work for ingest during segment creation. For push, is it correct to set outputDirURI to exactly the segment to push?
@npawar: ah you are talking about push only. Yes looking at the code, it should work if you directly give segment path.
@npawar: are you seeing different behaviour?
@tiger: I just tried it by setting outputDirURI to that and it seems to work. Just wanted to confirm that is a valid use case. Thanks!
@tiger: On another note, I have a question about how the push works. I'm currently using the Metadata push method. If I split up the creation and push steps, I believe the push job has to download the segment, and then generate the metadata right? If I use SegmentCreationAndMetadataPush, is it more efficient in that it can just directly create the segment and generate the metadata in one go? So it would save an extra download of the segment?
@npawar: yes that is correct, separating the 2 phases will create an extra download
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org