Skip to site navigation (Press enter)

Apache Pinot Daily Email Digest (2021-02-19)

Pinot Slack Email Digest Fri, 19 Feb 2021 18:00:24 -0800

#general

@bowlesns: So based off of the docs, since Pinot doesn’t have a specific date time format, and dates are converted to either strings, longs, or ints, does this hinder performance in any way? If it does, are there plans to add support for a datetime format?
@g.kishore: That’s right.. it can be stored in any format.. Strings are not great for performance but int/long primitive are better
@g.kishore: Can you explain “support for datetime”
@g.kishore: Pinot does support datetime but it allows users to dictate how datetime is stored which can be good and bad
@bowlesns: That makes sense, thank you!
@gonzalesteb: @gonzalesteb has joined the channel
@accounts_slack: @accounts_slack has joined the channel
@harbingeryellow: @harbingeryellow has joined the channel
@nick.saggese: @nick.saggese has joined the channel

#random

@gonzalesteb: @gonzalesteb has joined the channel
@accounts_slack: @accounts_slack has joined the channel
@harbingeryellow: @harbingeryellow has joined the channel
@nick.saggese: @nick.saggese has joined the channel

#group-by-refactor

@gonzalesteb: @gonzalesteb has joined the channel

#troubleshooting

@gonzalesteb: @gonzalesteb has joined the channel
@accounts_slack: @accounts_slack has joined the channel
@harbingeryellow: @harbingeryellow has joined the channel
@aaron: Is anybody using Pinot with an on-prem S3-like filesystem rather than AWS' S3? I am doing this and trying to run a batch ingest, and I get this error: ```Got exception to kick off standalone data ingestion job - java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144) ~[pinot-all-0.7.0-SNAPSHOT-jar -with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.7.0-SNAPSHOT-jar-wit h-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132) [pinot-all-0.7.0-SNAPSHO T-jar-with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0. 7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0 -SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] Caused by: java.io.IOException: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records . (Service: S3, Status Code: 403, Request ID: 0306422796023ADB, Extended Request ID: njXFdh82iDAWK78LUjRq1SCfJDgSD0Dcr9EhworrYh4CT7X0ZsPFVmHl2TUSmLK9e P/EyAwhAm8=) at org.apache.pinot.plugin.filesystem.S3PinotFS.mkdir(S3PinotFS.java:308) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT- 7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:127) ~[pinot-batch-ingest ion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) ~[pinot-all-0.7.0-SNAPSHOT-jar -with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694] ... 4 more ```
@aaron: Ok so -- looks like the batch ingest job was loading my credentials from `~/.aws/credentials` which 1) were not for this filer and 2) don't have the ability to specify my endpoint.
@aaron: I've configured the controller and server with the right credentials and endpoint as documented here:
@aaron: i.e. I'm setting: ```pinot.controller.storage.factory.s3.region=ap-southeast-1 pinot.controller.storage.factory.s3.accessKey=foo pinot.controller.storage.factory.s3.secretKey=foo pinot.controller.storage.factory.s3.endpoint=```
@aaron: (and s/controller/server as well for the server conf)
@aaron: How can I pick up these settings for the batch ingest job? After deleting .aws/credentials I get this error on batch ingest:
@aaron: ```aused by: java.io.IOException: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityToke nCredentialsProvider(), ProfileCredentialsProvider(), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredenti alsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or sy stem property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specif ied either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the envir onment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(): Profile file cont ained no credentials for profile 'default': ProfileFile(profiles=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neithe r AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): U nable to load credentials from service endpoint.] ```
@aaron: Is there any way to set my own endpoint for batch ingestion?
@bowlesns: When you say S3 like can you give more detail? I don’t know the low level details of the S3 plugin but I’m guessing you won’t want to use that unless it’s actually S3 you’re grabbing from.
@aaron: Oh sure -- it's literally API-compatible with S3, just I need to set the endpoint to something on-prem rather than AWS' servers
@aaron: In other words, from the Pinot docs, I need to set `pinot.controller.storage.factory.s3.endpoint` and the server equivalent I should be good -- but somehow this doesn't seem to be working for the batch ingest?
@aaron: I think the S3 plugin should work. I already do this with Trino and Trino's built-in S3 support using the aws sdk works
@aaron: Ok, I think I figured this out -- in addition to the S3PinotFS config options in the controller and server configuration files, I needed to set them in the job spec
@bowlesns: I had to do the same for GCP. Not sure if you’ve seen this but this doc has an example job file
@aaron: Thanks! I wasn't aware that I could put more under `configs` than region, this seems to work!
@fx19880617:
@fx19880617: you can put endpoint and more configs
@nick.saggese: @nick.saggese has joined the channel
@kha.nguyen: Hi there, I'm currently importing extremely large CSVs in batches into pinot. Does Pinot have a functionality that tells you the CSV row number if there's errors with the CSV when important as a batch file into Pinot?
@g.kishore: There is a hidden variable $rowId when you query Pinot
@g.kishore: You can use that to debug
@bowlesns: Ok so an update on using the minions to ingest, after small changes I see this in the logs. The tar.gz file exists in the bucket, but it looks like it tries to push anyways to the path `/segments/blah.tar.gz`. Not sure if this is a path on the controller, or if it’s supposed to be the bucket. Any ideas?
@fx19880617: looks like this is the push job type issue. Default push is segment tar push
@fx19880617: ```Trying to push Pinot segment with push mode tar from ```
@fx19880617: we should change it to metadata push or uri push, let me double check
@bowlesns: Where is it trying to push to? I don’t see any changes to what’s in the deep storage bucket configured on the controller so not sure.
@bowlesns: and thanks!
@fx19880617: can you add a config to ingestion job: ```push.mode=metadata```
@fx19880617: in parallel to input.uri, etc
@bowlesns: When it’s set to metadata, what is the behavior vs the default of segment tar push?
@bowlesns: I would think the default would push the tar to the output bucket I have set on the controller
@fx19880617: metadata/uri mode will not try to save segment in controller
@fx19880617: segment tar push expects a controller deep store to save all the tar file
@fx19880617: I still need to investigate why tar push mode failure
@bowlesns: Do I maybe need to define that deep store somewhere in the batchConfigMap or the minion?
@bowlesns: Thanks for clarifying
@fx19880617: I think if your minion is started with deepstore credential then should be fine
@fx19880617: otherwise, the batchConfig should contain the credentials
@fx19880617: uri push is lightweight on client(minion) side, controller will download segment and extract segment metadata
@fx19880617: metadata push is light weight on controller side, client(minion) will download the segment based on uri and extract metadata, then upload metadata to controller, no controller deep store download involved
@bowlesns: This is great info. Thank you so much I really appreciate it.
@fx19880617: You are welcome! I’m also going to add this into pinot docs
@bowlesns: I can try and add that this weekend or at least get some of the writing out of the way.
@fx19880617: sounds good! Many thanks!
@bowlesns: I have the mode for the batch ingest set to `APPEND`, believe I tried `REPLACE` as well before but it I think it didn’t like that
@ken: I ran a query designed to cause problems for the cluster (`select distinctcount(<super-high cardinality column>) from table`), and it did. The request timed out, even though I gave it a 100,000ms timeout, and now all queries (e.g. select * from crawldata limit 20) time out. I’ve looked at the controller/broker/sample of server logs, and don’t see any errors. In the broker log it looks like it’s getting no responses from servers: ```2021/02/19 22:21:53.860 INFO [BaseBrokerRequestHandler] [jersey-server-managed-async-executor-59] requestId=41163,table=crawldata_OFFLINE,timeMs=10000,docs=0/0,entries=0/0,segments(queried/processed/matched/consuming/unavailable):0/0/0/0/0,consumingFreshnessTimeMs=0,servers=0/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);116.202.83.208_O=0,-1,0,0;168.119.147.123_O=0,-1,0,0;168.119.147.125_O=1,-1,0,0;168.119.147.124_O=1,-1,0,0;116.202.52.154_O=1,-1,0,0,query=select * from crawldata limit 20``` But an example server log has this: ```2021/02/19 22:21:43.864 INFO [QueryScheduler] [pqr-11] Processed requestId=41163,table=crawldata_OFFLINE,segments(queried/processed/matched/consuming)=213/1/1/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=2,resSerMs=1,totalTimeMs=3,minConsumingFreshnessMs=-1,broker=Broker_168.119.147.124_8099,numDocsScanned=20,scanInFilter=0,scanPostFilter=620,sched=fcfs``` Trying to figure out which process or processes are borked because of the query, and why. Any ideas? Thanks!
@fx19880617: I feel the server cpu is still on the heavy query @jackie.jxt might have better understanding on the query lifecycle when query timed out
@ken: Could be, but I’d expect the pinotServer.log file to contain some kind of timeout for the subsequent query, versus what looks like a reasonable entry (e.g. `numDocsScanned=20` , etc)
@fx19880617: it should be a timeout in the log unless the query is not stopped expectedly
@jackie.jxt: I feel it might be the transport layer (netty) problem where servers are trying to serialize back too much data
@ken: I do see that there’s one missing requestId in the logs, which I think was the problematic request.
@jackie.jxt: It might somehow block the netty connection and cause the response for the second query not sending back
@ken: For a `distinctcount`, I guess each server has to send back to the broker all unique values for the column, for every segment that it’s processing.
@jackie.jxt: Server will combine all the values into a set, then send the set back
@ken: Right, but each server then sends back its set to the broker, which has to combine to get the final count
@jackie.jxt: Yes, broker has to merge all sets
@ken: I just did a thread dump from one of the Pinot Server processes, and everything looks fine - no Pinot code running, nothing blocked.
@jackie.jxt: Based on the log you posted, server side processed the second query without any issue, but broker didn't receive the response, and that's why I suspect something is broken in the transport layer. Maybe also check broker log to see if everything looks normal
@ken: Broker logs look normal, at least to me. See my broker log entry in the initial question.
@jackie.jxt: I feel the problem might be within the netty connection. Can you try restarting the broker and see if the problem is solved?
@jackie.jxt: The symptom here is that servers do get the query, and send the response back, but broker somehow does not receive the responses
@ken: @jackie.jxt Restarting the broker worked, thanks!!
@ken: But seems odd there were no errors in the broker log file. Should I file an issue about that?
@jackie.jxt: Yes, please file an issue so we can track this, thanks
@jackie.jxt: We rely on netty to transport data, maybe we hit some limitation in netty, but netty didn’t trigger the exception callback

#community

@gonzalesteb: @gonzalesteb has joined the channel

#announcements

@gonzalesteb: @gonzalesteb has joined the channel

#presto-pinot-streaming

@kha.nguyen: @kha.nguyen has joined the channel

#pinot-perf-tuning

@gonzalesteb: @gonzalesteb has joined the channel

#getting-started

@gonzalesteb: @gonzalesteb has joined the channel

#segment-write-api

@fx19880617: @fx19880617 has joined the channel
@yupeng: @yupeng has joined the channel
@npawar: @npawar has joined the channel
@fx19880617: Create this channel for segment write api
@fx19880617: @yupeng do you have sometime next week to discuss the context and detailed requirements and we can start a doc on this.
@yupeng: yes
@yupeng: how about some time Tue?
@npawar: works for me
@fx19880617: sounds good
@npawar: 11 am?
@fx19880617: I sent the invitation, feel free to move it around
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]