Apache Pinot Daily Email Digest (2021-05-24)

Pinot Slack Email Digest Mon, 24 May 2021 19:00:27 -0700

#general

@jcole: @jcole has joined the channel

#random

@jcole: @jcole has joined the channel

#troubleshooting

@pedro.cls93: Hello, I have a Pinot deployment done through K8s wherein I'm progressively adding fields to a realtime table. This deployment is very basic (1 server instance only, with 6GB for java Heap, the pod has a memory limit of 7GB, 100GB persistent storage, deep storage for segments has been enabled) but I'm getting multiple server restarts because the pod keeps getting killed with OutOfMemory errors while ingesting data and creating segments. It seems the cause is not the JVM itself but off-heap memory maps, please see the following image for more details.
@pedro.cls93: My question is how can I size the deployment of the server adequately, particularly how can I manage this off-heap usage?
@dlavoie: Server naturally uses off-heap
@dlavoie: My rule of thumb is leave 50% of the container memory to off-heap
@dlavoie: then adjust based on your metrics. You’ll see if with the metrics you have if you have room for more heap and if it would benefit given the GC patterns
@dlavoie: Table configs will have an impact on your off heap usage so I would start from the 50% rule of thumb then optimized based on the specific of the tables running on the system.
@pedro.cls93: Thank you for the feedback Daniel. What metrics in particular should I take a look at?
@dlavoie: standard container and jvm metrics
@dlavoie: Used memory from the k8s metrics with the JVM heap usage from JMX exporter
@dlavoie: These two will tell you about much is available for non heap
@pedro.cls93: From the grafana chart I put above it seems Pinot is using >2x the memory for offheap than JVM rather than ~100%. With such a high amount of memory mapped usage does it make sense to add more servers with the 50% for offheap rule?
@dlavoie: This is not the metric I meant
@dlavoie: This represents the size of the datastored mapped on disk.
@dlavoie: What we want is the size of the in-memory pointers
@dlavoie: A server can have TB of memory mapped data.
@dlavoie: Also
@dlavoie: What usually cause OOM from k8s is that your JVM settings are too high for the actual k8s resource request.
@dlavoie: Having too much elements off-heaps will not cause OOM, but disk swapping.
@dlavoie: OOM is because the heap is using memory it thinks is made available to the pod, but isn’t, hence the OOM from k8s.
@dlavoie: Tuning down the heap size usually fix that issue
@dlavoie: You could confirm that by monitoring K8s resource request vs physical usage
@pedro.cls93: Ok, so if I understand correctly. The processes running in my server pod are trying to use more memory than the pod has available (limit: 7GB)
@dlavoie: Exactly
@pedro.cls93: The JVM usage is as follows:
@pedro.cls93: The memory of the pod (reddish line):
@pedro.cls93: Goes to 175%. Meaning that my pod's limit should be at ~12.5GB (7*1.75)
@dlavoie: Just reduce the heap
@pedro.cls93: Does that sound about right Daniel?
@pedro.cls93: To half of the limit of the pod?
@dlavoie: Yes
@pedro.cls93: Isn't there a concern that such a small heap won't be enough to hold the segments in memory for fast querying?
@dlavoie: offheap is fast
@pedro.cls93: Thank you for the assistance, I will try out this config.
@pedro.cls93: If I may make one more question, in what scenarios do you want to increase the pod memory? When does it make sense to scale the servers vertically (more memory) vs horizontally (more servers)?
@dlavoie: Having more servers means you reduce the impact of redistributing or redownloading segments
@mayanks: What is the event rate from your event stream, and how many partitions do you have
@mayanks: Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap).
@pedro.cls93: 16 partitions. It is a scheduled cron, outputting daily ~50M entries. We are in the process of removing the cron job to an event-based stream.
@dlavoie: > Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap). That will cause a JVM OOM, not a K8S OOM
@mayanks: Oh sorry, long thread, I assumed it was JVM OOM.
@pedro.cls93: How can I distinguish them, in K8s? Via the pod logs?
@dlavoie: Yes,
@dlavoie: Heap OOM will be observed from the log as usually dreaded Heap exception
@dlavoie: the K8S OOM will just kill your container without no other mention than a `OOM` message in the pods events
@mayanks: Could we confirm if it is JVM or pod OOM? Also if it is JVM, is it heap or direct memory OOM?
@pedro.cls93: ```# # A fatal error has been detected by the Java Runtime Environment: # # [thread 140238581155584 also had an error] SIGBUS (0x7) at pc=0x00007f8c70a013c2, pid=9, tid=0x00007f8bd167c700 # # JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10) # Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Core dump written. Default location: /opt/pinot/core or core.9 # [thread 140238499997440 also had an error] # An error report file with more information is saved as: # /opt/pinot/hs_err_pid9.log [thread 140237039417088 also had an error] [thread 140238498944768 also had an error] [thread 140238586418944 also had an error] [thread 140238396491520 also had an error] [thread 140238582208256 also had an error] [thread 140238587471616 also had an error] # # If you would like to submit a bug report, please visit: # # Aborted (core dumped)``` Got this with a 3GB java heap and a pod memory request of 6GB (limit 7GB)
@dlavoie: Oh!
@dlavoie: That’s a JVM OOM
@dlavoie: Good call Mayank
@pedro.cls93: 134 error code, yh, no logs though
@mayanks: What does the hs_err.log say?
@mayanks: Likely it is because all 16 partitions are consuming at burst and try to allocate memory at the same time. if it is direct memory, then it is during consumption. If it is heap than it is segment generation happening in parallel
@mayanks: For direct memory, you could reduce partitions and increase footprint of jvm. For heap you can limit number of parallel segments to be generated
@mayanks: You can also throttle the event rate instead of pumping 50M records per burst
@pedro.cls93: Do you mean reducing the number of partitions in kafka?
@pedro.cls93: Or the `segmentPartitionConfig` config of the pinot table?
@mayanks: Kafka. But let’s first find out if heap or direct memory
@pedro.cls93: I'm trying to access the log file but the particular fs path where it is stored is not in a pvc
@mayanks: Anything in server log?
@pedro.cls93: `Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code` Off-heap memory I assume?
@pedro.cls93: Full trace: ```Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator$FunctionExecutionNode.execute(InbuiltFunctionEvaluator.java:116) # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x00007f51c5052416, pid=9, tid=0x00007f50a5553700 # [thread 139984278378240 also had an error] # JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10) # Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Core dump written. Default location: /opt/pinot/core or core.9 # # An error report file with more information is saved as: # /opt/pinot/hs_err_pid9.log at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator.evaluate(InbuiltFunctionEvaluator.java:87) at org.apache.pinot.segment.local.recordtransformer.ExpressionTransformer.transform(ExpressionTransformer.java:95) at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:82) at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:509) at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:416) at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:556) at java.lang.Thread.run(Thread.java:748) [thread 139984280483584 also had an error] # # If you would like to submit a bug report, please visit: # # Aborted (core dumped)```
@pedro.cls93: ```LLRealtimeSegmentDataManager$PartitionConsumer.run``` Appears to be in the partition consumer
@mayanks: Are you using any groovy functions?
@pedro.cls93: Yes
@mayanks: What does it do
@pedro.cls93: Parses a time string in a wierd format from a json field, to the java standard and returns the milliseconds since epoch of that time.
@mayanks: Ok. My guess is the heap OOM
@mayanks: Do you have logs from server?
@mayanks: In server config there is a way to limit number of segments being flushed in parallel.
@pedro.cls93: That is the last line in log from the server.
@mayanks: Yeah but can you check if there was segment generation happening around that time
@mayanks: At high level 3GB for 16 partitions getting 50M events in a burst is low for heap as well as direct memory
@pedro.cls93: I can't see anything referring to segment generation:
@dlavoie: I think the relevent logs are only available inside the pod (stdout only shows WARN). The INFO file is lost on restart with the default configs.
@mayanks: If you want to optimize cost then you can throttle the burst so consumption event rate is lower, reduce num parritojis in Kafka once max event rate is low, and limit number of parallel segment generations
@pedro.cls93: I see some INFO level logs in stdout
@mayanks: Or else just add more vms :grinning:
@pedro.cls93: At this point, simply having a formula that lets me know the memory requirements and size up the servers, would be good enough
@mayanks: Yeah there is a real-time provision tool in the docs
@mayanks: Have you tried it
@pedro.cls93: ?
@mayanks: Yes
@mayanks: Also there is doc about it
@mayanks: It might not take care of your bursty nature of events, but let’s see what it proposes
@pedro.cls93: I'll take a look, get back to you soon. Thank you both so much for the help
@pedro.cls93: Is this tool available as a image, within an image?
@dlavoie: I think you should find it within the pinot image as a standalone script in `bin`
@pedro.cls93: Is the tool meant to take a long time? Been running for 5m. ```docker run --rm -v /home/pedro/dev/Pinot:/tmp/volume apachepinot/pinot:release-0.7.1 RealtimeProvisioningHelper \ -ingestionRate 1000 \ -numPartitions 16 \ -retentionHours 720 \ -numRows 50000000 \ -tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json \ -schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json Executing command: RealtimeProvisioningHelper -tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json -numPartitions 16 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json -numRows 50000000 -ingestionRate 1000 -maxUsableHostMemory 48G -retentionHours 720```
@shaileshjha061: Hi, Prometheus/grafana and pinot is deployed in two different namespace. Can we integrate that two and monitor?? We are not getting pinot log metrics in Prometheus. @fx19880617 @mayanks
@dlavoie: That should work out of the box.
@dlavoie: What is your scrape configuration in Prometheus?
@dlavoie: Are the pinot endpoints registered?
@shaileshjha061: pinot is not there is scrape configuration.
@dlavoie: Is your prometheus configured to listen to k8s annotations to discover scraping configs?
@shaileshjha061: I have added service: annotations: "": "true" "": "8008"
@shaileshjha061: according to docs
@dlavoie: That is for Pinot
@dlavoie: But is Prometheus aware is needs to listens to k8s annotations?
@dlavoie: Also
@dlavoie: `service.annotations` is not the spot for your annotations
@shaileshjha061: But is Prometheus aware is needs to listens to k8s annotations? Can you helm me with this?
@dlavoie: I can’t helm you :slightly_smiling_face:
@dlavoie: Only help :rolling_on_the_floor_laughing:
@shaileshjha061: yeah:smiley:
@dlavoie: Are the annotations present on your pinot pods?
@shaileshjha061: podAnnotations: "": "true" "": "8008"
@dlavoie: Please show the output of `kubectl describe pod -o yaml <pod-name>`
@shaileshjha061:
@dlavoie: labels are fine
@dlavoie: Can you share your prometheus configuration file?
@shaileshjha061: let me get that
@shaileshjha061: Is describe pod works??
@shaileshjha061: prometheus is deployed by someone else. will take time to get files.
@dlavoie: That prometheus needs to be configured to look up for the pod annotations
@dlavoie: The configuration is not part of the pod definition.
@dlavoie: your prometheus SRE should know how to enable that
@shaileshjha061: let me check
@shaileshjha061: Thanks Daniel
@jmeyer: Hello :slightly_smiling_face: When no data is found (e.g. after filtering), an `AVG` aggregation returns `'-Infinity'` Does this behavior change once we enable null value support (`nullHandlingEnabled`) ? Either way, should the Python client parse the string values (since JSON doesn't have support for inf), as `[-]np.inf` ?
@mayanks: I think this may not be related to nullValue support. What does query console return?
@jmeyer: I was thinking there may be a link, since, at least in my opinion, returning `NaN` is better than any representation of `Infinity` in this case
@jmeyer: Though I agree that its not quite related, indeed
@mayanks: Is the column on which you are taking avg defined as metric?
@jmeyer: It is
@jmeyer: Float, metric
@jmeyer: The API returns `"-Infinity"` and the Web UI shows the same
@mayanks: Can you paste the returned json?
@jmeyer: Sure
@mayanks: If no rows selected, why is there even a value being returned
@jmeyer: ```{'exceptions': [], 'minConsumingFreshnessTimeMs': 0, 'numConsumingSegmentsQueried': 0, 'numDocsScanned': 0, 'numEntriesScannedInFilter': 2, 'numEntriesScannedPostFilter': 0, 'numGroupsLimitReached': False, 'numSegmentsMatched': 0, 'numSegmentsProcessed': 2, 'numSegmentsQueried': 2, 'numServersQueried': 1, 'numServersResponded': 1, 'resultTable': {'dataSchema': {'columnDataTypes': ['DOUBLE'], 'columnNames': ['aggregated_value']}, 'rows': [['-Infinity']]}, 'segmentStatistics': [], 'timeUsedMs': 6, 'totalDocs': 894, 'traceInfo': {}}```
@mayanks: Hmm, I would have expected `rows` to be empty?
@mayanks: Given `numDocsScanned` = 0
@mayanks: @jackie.jxt?
@jmeyer: Yeah, sounds more logical to me too I can confirm that querying without the aggregation really yields 0 documents
@jackie.jxt: I assume this is a `max()` aggregation?
@jmeyer: @jackie.jxt Nope, it is `AVG`
@jackie.jxt: For aggregation, when there is no record selected, Pinot returns the default aggregation result
@mayanks: Why not empty?
@jmeyer: (And `SUM` returns `0.0` )
@jackie.jxt: Because even if there is no row selected, aggregation still make sense. `SUM` of 0 record is `0.0`
@jackie.jxt: For `MAX`, it is `-Infinity`; For `MIN`, it is `Infinity` etc.
@mayanks: In group-by though, woud we have empty results?
@jackie.jxt: Yes, because there is no group exists
@mayanks: If you think of output as a table, then we should return empty row for aggr only
@jmeyer: @jackie.jxt Is there any way to influence this default aggregation result ?
@jackie.jxt: No, the default result is fixed for each aggregation
@jackie.jxt: Seems the standard SQL behavior returns `null` in case there is no record selected
@jmeyer: Would this change be desirable for Pinot ?
@jackie.jxt: Pinot does not support real `null` yet, and `null` is always represented as a default value as of now
@jackie.jxt: But we are trying to match the standard SQL behavior
@jmeyer: Even with ? I know (according to docs) that it's still a "partial" support, as in there's no "interaction" with aggregations etc, but it's possible to efficiently filter null values (and not mix them with real values [e.g. 0.0]), right ? *Edit:* Updated link
@jackie.jxt: True, currently we only support explicit filtering on null values
@jackie.jxt: We don't support null values in the query responses yet
@jmeyer: Yep okay So AVG with null values (without explicit filtering) will just result in the values being considered as their default (e.g. 0.0) value in the aggregation ?
@jackie.jxt: Yes
@jmeyer: Thanks @jackie.jxt :slightly_smiling_face:
@jmeyer: And @mayanks
@jackie.jxt: The default default null values is documented here:
@jmeyer: @jackie.jxt And empty group aggregation results ?
@jcole: @jcole has joined the channel

#pinot-dev

@santosh.reddy: @santosh.reddy has joined the channel
@moradi.sajjad: We at LinkedIn saw production issues with a recent commit in upload segment endpoint. I described it in this , but wanted to share it here as well. (cc @tingchen @jackie.jxt @ssubrama @yupeng @fx19880617 - ppl involved in the PR)
@tingchen: thanks for reporting the issue @moradi.sajjad. I will take a look today and comment on the issue about solution and fixes.

#feat-partial-upsert

@qiaochu: @yupeng @jackie.jxt let me try reproduce this issue and see if still exist

#minion-improvements

@laxman: Status ===== • Added some raw unit tests to reproduce the issue with schema changes. • Cherry-picked and merge Record reader changes from master branch • Added null value transformer to SegmentMapper alone • Fixed and deployed in multiple test environments Above fixes together looks working. Will observe and update the status again. Please take a look at the changes here Unit tests here
@laxman: > Added null value transformer to SegmentMapper alone @jackie.jxt /@fx19880617: Do you see any issues with this approach? Please take a look at unit test and the fix the fix once. I felt there is no need to preserve null vector in the generic row while persisting the segment. Changes as of now are not ready to merge into master. Your review and thoughts around this please.
@laxman: I still think this fix doesn’t work in atleast one scenario. Consider the case of adding a nullable new field to the schema and its value is truly null it doesn’t have a default null value. In that case the above fix still fails. Eventually, I think we still need to have avro union based fix along with null value transformer fix to handle all the cases.
@jackie.jxt: In Pinot, a default null value is always needed. We use a null value vector to store the docIds for all null values.
@jackie.jxt: Currently in the reducer `Collector`, the null values info is missing, and we only have the default value preserved
@jackie.jxt: I'll try to fix this behavior early this week
@laxman: @jackie.jxt: does that mean the fix I did (null value transformer in mapper alone) causes some data issues while converting an old segment created before schema changes?
@jackie.jxt: It can solve half of the problem, but not the whole problem because the reducer will lose the null value information
@laxman: cc: @jackie.jxt
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org