Apache Pinot Daily Email Digest (2021-08-16)

Pinot Slack Email Digest Mon, 16 Aug 2021 19:00:27 -0700

#general

@vicky301186: how I can see the query plan in pinot, I want to verify it only hits a certain set of segments based on specific time-range filter in my query
@sosyalmedya.oguzhan: i don't know you can see query plan or not, bu you can check number of scanned segments in your query response
@prashant.pandey: I think this feature is in development . However, I think response stats contain the info you need.
@vicky301186: Hi Team, I am trying to create hour based segments in pinot but it's creating more than one folder into segments for the same hour, I guess this is due to some default row/data size, can I modify these default configurations and how what it preferable size of the data segment in pinot, what is the philosophy here too many files with a small size or minimum file with a decent size any reference on above
@vicky301186: schema: ```{ "schemaName": "svd", "dimensionFieldSpecs": [ { "name" : "serviceId", "dataType" : "STRING" }, { "name" : "currentCity", "dataType" : "STRING" }, { "name" : "currentCluster", "dataType" : "STRING" }, { "name" : "phone", "dataType" : "STRING" }, { "name" : "epoch", "dataType" : "LONG" } ], "metricFieldSpecs": [ { "name" : "surge", "dataType" : "DOUBLE" }, { "name" : "subTotal", "dataType" : "DOUBLE" } ], "dateTimeFieldSpecs": [ { "name": "dateString", "dataType": "STRING", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd-HH", "granularity": "1:DAYS" } ] }```
@vicky301186: table config ```{ "tableName": "svd", "ingestionConfig": { "transformConfigs": [ { "columnName": "dateString", "transformFunction": "toDateTime(epoch, 'yyyy-MM-dd-HH')" } ] }, "segmentsConfig" : { "timeColumnName": "dateString", "timeType": "MILLISECONDS", "replication" : "1", "schemaName" : "svd" }, "tableIndexConfig" : { "invertedIndexColumns" : ["serviceId"], "loadMode" : "MMAP", "segmentPartitionConfig": { "columnPartitionMap": { "currentCity": { "functionName": "Murmur", "numPartitions": 4 } } } }, "routing": { "segmentPrunerTypes": ["partition"] }, "tenants" : { "broker":"DefaultTenant", "server":"DefaultTenant" }, "tableType":"OFFLINE", "metadata": {} }```
@mayanks: Hi, you can refer to
@sosyalmedya.oguzhan: For offline tables, you have to configure number of rows in your output file (that can be converted to segment later). Pinot just converts input file to segment, and one file is equal to the one segment. For your realtime tables; you can check configurations
@tiger: @tiger has joined the channel
@jai.patel856: I had a general question about Upsert. Are the resource required expected to be “significantly” higher than a normal Realtime table? I ask because our Upsert table seems to take significantly more resources. Our upsert table is a considerably wider table, but I’d like to understand if it’s that width that’s contributing a bulk of that load, or if it could be Upsert itself.
@g.kishore: yes, upsert needs more resources because of key - row id mapping. But the number of columns in the table should not increase the overhead.
@yupeng: also, consider not too complex primary key values (e.g. single value but not composite). or use this `hashFunction`
@jai.patel856: Thanks. Our keys are UUID or UUID+UUID. The first problem we found was that they were not uniformly distributed. So we hashed them with XX3 (xxhash). That definitely helped with the balance and turned them into longs. But we continue to use the tuple of UUIDs for the partitionKeyColumns.
@jai.patel856: Oh, and to add a little more detail, we found that the lack of uniformity started with the Kafka key when we used UUIDs. So we weren’t getting an even spread across the servers and we ultimately had hot nodes.
@roberto: One question, in the official java client (Not the JDBC one) is it possible to configure the basic auth?
@mayanks: Seems it does not support that right now. Perhaps you can file an issue?

#random

@tiger: @tiger has joined the channel

#troubleshooting

@kangren.chia: i encounter this issue when trying the spark ingestion: ```Caused by: java.lang.NullPointerException at org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(SystemUtils.java:1626) at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala) at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207) ... 27 more at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2611) at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:198) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:93) at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:311) at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) Exception in thread "main" java.lang.ExceptionInInitializerError at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:359) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272) at org.apache.spark.SparkContext.<init>(SparkContext.scala:448) at org.apache.spark.SparkContext.<init>(SparkContext.scala:125) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:67) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:370) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)ullp```
@mayanks: What version of Java are you using?
@kangren.chia: ```spark 3.0.2 pinot 0.7.1 java -version openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment 18.9 (build 11.0.10+9) OpenJDK 64-Bit Server VM 18.9 (build 11.0.10+9, mixed mode, sharing)```
@kangren.chia: i get the jars for spark submit from here: ```${SPARK_HOME}/bin/spark-submit \ --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ --deploy-mode cluster \ --conf "spark.driver.extraJavaOptions=-Dplugins.dir=/opt/pinot/plugins -Dlog4j2.configurationFile=/opt/pinot/conf/pinot-ingestion-job-log4j2.xml" \ --conf "spark.driver.extraClassPath=/opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.7.1-shaded.jar:/opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.7.1-shaded.jar:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar" \ --jars local:///opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.7.1-shaded.jar,local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar,local:///opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.7.1-shaded.jar,local:///opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar \ local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar -jobSpecFile jobSpec.yaml | tee output```
@mayanks: We are seeing some issues with newer Spark version, could you try Spark 2.3x?
@mayanks: Here's a similar thread:
@kangren.chia: i can’t see that thread, i think it’s buried due to the 10k message limit
@kangren.chia: let me try some workarounds
@bruce.ritchie: Just upgrade apache commons to latest in your deployment.
@mayanks: Thanks @bruce.ritchie
@tiger: @tiger has joined the channel
@roberto: hi!! I’m trying to add authentication to my pinot instance and it seems that after adding authentication I’m not able to perform queries from the controller UI because of a 403. Is there any way to add authentication using the UI?
@mayanks: Does this help:
@roberto: Exactly @mayanks!! I followed that guide. In fact the login page is shown and I can log in without problems
@mayanks: :+1:
@roberto: the issue is in UI when I try to perform a query
@mayanks: Oh, sorry, I thought you were confirming that you find a solution.
@roberto: Checking all requests I see that from the UI all calls includes the `Authorization: Basic (my_token)` header but it isn’t included when a query is performed
@roberto: I have verified calling directly to the `/sql` endpoint adding the header manually and it worked, I think that it is a UI problem
@mayanks: I take it you have setup same username/password on the controller as well as broker?
@roberto: yep
@roberto: My controller config: ```controller.admin.access.control.factory.class=org.apache.pinot.controller.api.access.BasicAuthAccessControlFactory controller.admin.access.control.principals=MY_USERNAME controller.admin.access.control.principals.oscilar.password=MYPASSWORD controller.segment.fetcher.auth.token=Basic MYTOKEN (calculated as base64(MY_USERNAME:MYPASSWORD))``` My broker config: ```pinot.broker.access.control.class=org.apache.pinot.broker.broker.BasicAuthAccessControlFactory pinot.broker.access.control.principals=MY_USENAME pinot.broker.access.control.principals.oscilar.password=MYPASSWORD```
@g.kishore: I dont think we have hooked up the UI for auth yet
@roberto: ok! that makes sense compared with what I have seen
@roberto: thanks!

#pinot-dev

@yash.agarwal: @yash.agarwal has joined the channel

#getting-started

@tiger: @tiger has joined the channel
@tiger: Hi, I'm trying to batch ingest a lot of data in some ORC files, what is the recommended way of doing this? I'm currently using the SegmentCreationAndMetadataPush job with the command line interface.
@g.kishore: Thats a good way to get started. In prod, you use spark to setup these jobs.
@tiger: Thanks! Also, is there a way to configure the segment generation with batch ingest? For example, is it possible to pass in 1 ORC file, and specify it to create N number of segments or to create segments of specific size?
@g.kishore: Not as of now. right now its input file -> one pinot segment
@g.kishore: there is a segment process framework WIP that can allow you to do some of these things
@tiger: Ok got it. How important are segment sizes in pinot? I saw on the FAQ that the recommended size is 100-500MB. Should I try to make it so that all the segments are roughly the same size?
@mayanks: As long as you are in the ballpark, it is fine.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org