Apache Pinot Daily Email Digest (2021-09-10)

Pinot Slack Email Digest Fri, 10 Sep 2021 19:00:26 -0700

#general

@alec: @alec has joined the channel
@pascal.jerome: @pascal.jerome has joined the channel
@karinwolok1: :mortar_board: I don't know about all of you, but I'm still riding the Apache Pinot graduation wave :heart_eyes: :wine_glass: :mortar_board: Just wanted to take a moment to shout out our amazing Apache Pinot contributors. Thank you for all that you do and for being such a big part of this movement!!! :heart: As Pinot continues to grow, you'll be touching the lives of many (developers and end-users everywhere :smiley: ) Here's a blog post by my colleague @allison that highlights many (definitely not all) of these awesome folks who helped bring this project to where it is today @jackie.jxt @xiangfu0 @npawar @g.kishore @mayanks @ssubrama @slack1 @snlee @jlli @steotia @yupeng @miliang @kharekartik @tingchen @sanket @chinmay.cerebro @amrish.k.lal @hbwang89 @moradi.sajjad @jiapengtao0 @shgandhi @elon.azoulay @jamesyf.shao @chethanu.tech @guo.980 @yash.agarwal @pradeepgv42 @ken @joey @vananth22 @s.aditya.1811 @atri.sharma For the full list of contributors, check out the Apache Pinot github (and give it a :star: if you haven't already!)
@ebuth: @ebuth has joined the channel

#random

@alec: @alec has joined the channel
@pascal.jerome: @pascal.jerome has joined the channel
@ebuth: @ebuth has joined the channel

#troubleshooting

@kangren.chia: has anyone seen the client become the bottleneck for their queries before? i’m seeing `timeMs=~1000ms` on my broker, but it takes 2/3 seconds on python/java clients on that single line to execute the query e.g. ```# python curs.execute(sql_query) # java ResultSetGroup pinotResultSetGroup = pinotConnection.execute(pinotClientRequest);```
@mayanks: What's the size of the response?
@kangren.chia: this is for a query that fetches 400k ints
@kangren.chia: is the response size in the broker logs?
@kangren.chia: oh yeah it is, `ResponseSize=6000433`
@mayanks: I am guessing there's some JSON rendering/conversion happening that might be taking time cc: @jackie.jxt
@kangren.chia: I can’t use idset and in_subquery directly to have all the work happen in the cluster so I need to split it into 2 queries on the client side
@kangren.chia: Would be nice if idset could work with queries that have a GROUP BY + HAVING clause
@kangren.chia:
@kangren.chia: for the python client, the slowness definitely comes from the client: ```def execute_query(query): start = timer() curs.execute(query) end = timer() print(f"query took {end-start}") return curs.fetchall()``` ```python3 -m cProfile query.py Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 208/1 0.015 0.000 48.651 48.651 {built-in method builtins.exec} 811465/5 0.907 0.000 47.301 9.460 db.py:41(g) 405731/2 0.519 0.000 42.400 21.200 db.py:52(g) 2 0.368 0.184 42.400 21.200 db.py:369(fetchall) 405729 0.396 0.000 41.531 0.000 db.py:393(__next__) 405729 0.348 0.000 40.211 0.000 db.py:345(fetchone) 405731 39.863 0.000 39.863 0.000 {method 'pop' of 'list' objects} 2 0.000 0.000 4.900 2.450 db.py:263(execute)``` @xiangfu0
@xiangfu0: hmm, it may comes from the results json parsing
@kangren.chia: does broker return JSON only? is there a more efficient way to retrieve data?
@mayanks: The broker currently returns JSON, and I have also seen JSON parsing being slow for really large response sizes
@ken: We also noticed JSON parsing being the bottleneck for large responses. It would be useful to have a binary/serialized response format, similar to how Solr supports either xml or binary.
@ken: Though this definitely can create compatibility issues, when you want to extend the response format. At the time this was implemented for Solr, Avro didn’t exist but would have been a better option than their own custom format.
@mayanks: Yeah, it should be possible to send back serialized response that client side library can provide abstractions to iterate on etc. And yeah, I can see how it opens up the door for compatibility issues. Historically speaking though, the Pinot response has not changed that frequently
@mrpringle: Do we have any way to control skipUpsert when using presto on top of pinot?
@alec: @alec has joined the channel
@pascal.jerome: @pascal.jerome has joined the channel
@luisfernandez: hey friends, we are trying to setup pre aggregation for one of realtime tables in our cluster for this schema ```{ "schemaName":"etsyads_metrics", "dimensionFieldSpecs":[ { "name":"shop_id", "dataType":"LONG" }, { "name":"listing_id", "dataType":"LONG" } ], "metricFieldSpecs":[ { "name":"cost", "dataType":"INT", "defaultNullValue":0 }, { "name":"impression_count", "dataType":"INT", "defaultNullValue":0 }, { "name":"click_count", "dataType":"INT", "defaultNullValue":0 }, { "name":"order_count", "dataType":"INT", "defaultNullValue":0 }, { "name":"revenue", "dataType":"INT", "defaultNullValue":0 } ], "dateTimeFieldSpecs":[ { "name":"serve_time", "dataType":"LONG", "format":"1:SECONDS:EPOCH", "granularity":"1:HOURS" } ] }``` we were reading this and we set this as our noDictionaryMetrics ``` "noDictionaryColumns": [ "click_count", "order_count", "impression_count", "cost", "revenue" ], "aggregateMetrics" : true``` but we still see multiple rows for the same dimension values in the same hour, does anyone have some guidance as to why this may be happening?
@jackie.jxt: Is this a realtime table?
@luisfernandez: that is right
@jackie.jxt: Within each consuming segment there will be unique dimension values
@jackie.jxt: If you have multiple stream partitions, then there will be multiple consuming segments
@jackie.jxt: Also when the previous consuming segment gets flushed, the next one will have the same dimension values again
@ebuth: After we truncated our time values to the hour (before sending them to pinot), we did see aggregated values as expected. It almost seemed like pinot was including our time value as a dimension for the purposes of aggregation.
@ebuth: And ignoring the “granularity” setting.
@jackie.jxt: The `granularity` field within the `dateTimeFieldSpecs` is just for description purpose as of now, and pinot won't automatically truncate the time value
@jackie.jxt: You can use the ingestion transform to truncate the time value
@ebuth: @ebuth has joined the channel
@dadelcas: Hey, I'm having an issue with hybrid tables. It looks like RealtimeToOfflineSegmentsTask was not moving all the records in the realtime segment, son for example I've got 4mill records in the realtime table but I end up with only 600 in the offline table after the realtime segments are deleted. Surely this is a configuration issue in my table. I was wondering whether this could be a problem between the table retention periods and the offline flow buffer configuration.
@dadelcas: In fact the realtime segments in S3 are few MB each whereas the offline segments are nearly few KB
@npawar: can you share both table configs?
@npawar: the realtime to offline job will move data from realtime to offline, periodically, 1 window at a time. So if your realtime table has several time windows, it can take several runs before all data is moved over. If the realtime table’s retention kicks in before that, it will get deleted. The 2 processes aren’t aware of each other

#getting-started

@sina.tamizi: Hi team, I am working on a project for realtime speed test calculation. I get the speed test data from devices with kafka ingestion. Once they are in Pinot the following calculations need to be performed: - peak hour 7pm -11 pm data to be selected. - data comes in different time stamp, the average speed needs to be calculated every hour between 7 pm to 11 pm everyday E.g: 7-8 pm average , 8-9 average, 9-10 average and 10 -11 pm average. ( the average data for every hour should be available as soon as the 1 hour windows is completed ) - 4 average data needs to be stored into another table where we would have 4 sample data points per day. - from the second table past 14 days data need to be selected and 3rd worse speed should be reported and stored into another table. Both these two tables would be my reports. The question is if pinot is the suitable platform to do these sort of calculations ? What would be the best way to run ETL jobs or tasks for run the query to do the calculations ? I have already done this with InfluxDB, however I would like to design/implement this with Pinot. Note that I also have other use cases with the same data where I need the data to be reported on realtime. Thank you in advance for your help.
@g.kishore: You can achieve this with Pinot but the ETL task needs to be scheduled and maintain outside of Pinot. If you have Kafka create a topic for each table.. schedule the task using a scheduler like Airflow. I each task, query Pinot and write the output to another Kafka topic..
@sina.tamizi: @g.kishore thanks for your help. Unfortunately, the Kafka server is in another environment and I simply consume the data from it. I am a bit hesitant to introduce a new Kafka cluster just for writing data from one table to another. is there any other way ? it would be great if Pinot could support INSERT TO
@g.kishore: we dont have that today, you can do it via presto/trino cc @elon.azoulay
@elon.azoulay: @elon.azoulay has joined the channel
@alec: @alec has joined the channel
@tiger: Does Pinot automatically delete the generated indices from the servers after deleting a segment? I'm running into an issue where I would delete the segment through the REST API, but it leaves the index files under PinotServer/index behind. The indices built up over time from my test tables and now the servers are out of disk space
@jackie.jxt: The indices should be in the same folder as the segment, and removed along with the segment. Can you please check what files are left over on the servers?
@tiger: It seems that all the segments are still remaining in PinotServer/index. I think my cluster somehow ended up in a weird state where it shows something like 1000/0 segments in the controller UI for that table.
@tiger: It doesn't show the segments anymore through the UI but still returns the segments when viewing them through the REST API. Trying to delete through the REST API again seems to do nothing now.
@jackie.jxt: Can you please check the idealstate and external view of the table?
@tiger: The ideal state is empty. The external view actually shows all of the segments as OFFLINE
@jackie.jxt: The external view still exist?
@jackie.jxt: Can you check if there is any ERROR in the server log?
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2021-09-10)

#general

#random

#troubleshooting

#getting-started

Reply via email to