Apache Pinot Daily Email Digest (2022-04-14)

Pinot Slack Email Digest Thu, 14 Apr 2022 19:00:48 -0700

#general

@ysuo: Hi, if I tag a server tenanta and a broker tenanta, then modify the table’s tenants config to use tenanta broker and tenanta server, and finally rebalance brokers and servers, will the table’s index data be moved to the server tagged tenanta? From my test, the index data for this table is still located on the previous server.
@mayanks: Yes, rebalance will do the move. Are you saying that the data is not moved at all, or that the old server “also” has the data apart from the new one?
@ysuo: the data is not moved at all.
@mayanks: When you try the dry run does it give you the new ideal state that is different?
@ysuo:
@ysuo: 8216 is my tagged server
@mayanks: 8216 is new or old?
@ysuo: new
@mayanks: What’s the replication?
@ysuo: but index data is still in 8212.
@ysuo: 1 replication
@ysuo: and no index data in 8216
@mayanks: Then no way to move without downtime
@ysuo: ok
@ysuo: I see
@mayanks: I agree that the rebalance command should give back that feedback. You can file a GH issue for that
@ysuo: It’s moved to 8216 now. Thanks. And will the index data in 8212 be deleted automatically?
@mayanks: yes
@ysuo: RetentionManager task does the job? So I’ll just wait a moment to check if it’s deleted?
@mayanks: Is it not deleted by rebalance? cc: @jackie.jxt
@ysuo: I checked the files in 8212. Only empty folders left and no index data already.
@mayanks: Ok, then we are good
@ysuo: Hi, Pinot creates initially the same number segments as the number of a topic partition when ingesting a Kafka stream data to a table. Can I adjust the segment number instead of the topic partition number?
@kharekartik: By adjusting, you mean fix the number of segments equal to number of partitions OR just append some number suffix?
@ysuo: If a topic has 100 partitions, can Pinot create just 20 segments when the table is initially created?
@ysuo: I mean, if a topic has 100 partitions, 100 partition consumers will be created, right? Can partition consumer numbers be adjusted?
@kharekartik: No, that wont be possible. For lowLevel consumer, each partition is consumed independently of others and goes into a different segment.
@kharekartik: We may be able to do in future with already added classes for PartitionGroups whereby few partitions can be mapped to a single consumer. Currently, though, there is no plan for doing that
@g.kishore: Alice.. any reason why you want to do this?
@g.kishore: You can always merge segments into bigger segments later
@g.kishore: Maintaining one segment per partition has a lot of benefits
@ysuo: due to limited resources, cpus. If a topic has 100 partitions, I want to create 5 tables to consume the same topic, will 500 partition consumers be created? Any suggest I can refer to design my table?
@ysuo: Thanks, but how can I do this? To merge segments into bigger segments?
@mark.needham: One way to merge them is once they're in an offline table you can use the merge rollup task -
@ysuo: Ok, I’ll try. Thanks. @mark.needham
@mark.needham: there's also a video explaining how it works in more detail -
@ysuo: Thanks, that’s very helpful.
@g.kishore: See real-time to offline minion task
@mark.needham: that one is described here -
@octchristmas: I'm looking for a way to delete (a row) and change data in pinot . For example, if a member withdraws, all data of the member must be deleted immediately. - I can replace segments of a full period in an offline table. - In realtime tables I would use UPSERT mode. I can upsert null values. But I can't use star-tree index. Can I delete without using UPSERT mode? Is there a way to delete a Row from a segment of an offline + realtime table in Pinot?
@mark.needham: Hey - you can't delete individual rows. As you said, you can only delete at the segment level for both offline and real-time tables.
@richard892: @octchristmas I'm not a GDPR expert but if you get a GDPR request is it enough to make it impossible to _retrieve_ data for that user?
@richard892: because if that's enough (I know there are strategies in some frameworks like encrypt the data and throw away the encryption key when the request comes in) we can make easily an indexing feature to support this
@richard892: essentially we could apply a mask to the data meaning "don't read this row" but which would keep the offline segment format immutable
@octchristmas: @mark.needham Thanks for the answer, how to replace the segments (comsuming and completed) of a realtime table?
@octchristmas: @richard892 Thanks for the answer, are encryption keys managed per user? I want to know more. Do you have any documentation on this?
@richard892: sorry I was asking about your requirements, not describing a feature
@mayanks: @octchristmas The minion purge task can be used for GDPR purging
@octchristmas: @richard892 I have to comply with the GDPR, and the best way is to delete the individual rows.
@mayanks: Yes minion purge task for that @octchristmas
@octchristmas: @mayanks Thanks so much! :heart_eyes: I checked Pinot's Purge Task, but I didn't mention it because I wanted to see if there was any other way. However, I think it's only a way to use PurgeTask, so I have a few questions about PurgeTask. Q1) Can PurgeTask also delete individual rows in the committed segment (not concealing) of the realtime table? Q2) PurgeTask does not seem to delete individual rows by random accessing segment files. If PurgeTask downloads, regenerates, and uploads segment files, what is the difference from injection job? I am trying to understand this difference. We will prefer an injection job to using Minion and implementing a task code. because it is more familiar to develop an injestion job. Q3) We can service large amounts of data and large numbers of segments. If PurgeTask works in a download and regenerative manner, regeneration and reload of segments will likely affect clusters or services, regardless of whether PurgeTask or InjectionJob is used. How will the cluster or service be affected?
@mayanks: 1. Purge task is for offline tables
@mayanks: 2. It is smarter to avoid regeneration of segment if nothing change. Also it takes away the burden of maintaining another ingestion pipeline. But essentially it is the same
@mayanks: 3. Download/upload of data should not impact cluster performance. How much data are we talking about?
@gxm.monica: Hi everyone, I was trying to use spark to do batch ingestion. But I got an error like this when I executed: ```ERROR StatusLogger Unrecognized format specifier [d] ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [thread] ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern. ERROR StatusLogger Reconfiguration failed: No configuration found for '533ddba' at 'null' in 'null' Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.pinot.tools.admin.command.StartKafkaCommand.<init>(StartKafkaCommand.java:51) at org.apache.pinot.tools.admin.PinotAdministrator.<clinit>(PinotAdministrator.java:98) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:237) at $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:813) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.util.NoSuchElementException at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.pinot.tools.utils.KafkaStarterUtils.getKafkaConnectorPackageName(KafkaStarterUtils.java:54) at org.apache.pinot.tools.utils.KafkaStarterUtils.<clinit>(KafkaStarterUtils.java:46) ... 12 more``` It seems like spark couldn't find `org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory` from Kafka plugin. I built pinot from source code on the `master` branch using command (because we use jdk8 in our machines): ```mvn clean install -DskipTests -Pbin-dist -T 4 -Djdk.version=8``` My spark job using commands like this, which I've set `-DPlugins.dir` according to : ```export PINOT_VERSION=0.10.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=/home/xxx/apache-pinot-0.10.0-SNAPSHOT-bin echo ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar cd ${PINOT_DISTRIBUTION_DIR} ${SPARK_HOME}/bin/spark-submit \ --class org.apache.pinot.tools.admin.PinotAdministrator \ --master "local[2]" \ --deploy-mode client \ --conf "spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre" \ --conf "spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre" \ --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \ --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \ ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ LaunchDataIngestionJob \ -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/transcriptData/sparkIngestionJobSpec.yml``` Is it because spark couldn't find my plugins' jars from `plugins.dir`, I'm not familiar with spark, do I need to add all plugins' jars to spark classpath using `--jars` or something? Could you help me?
@kharekartik: Hi. Can I understand why you need to use Kafka in Batch Ingestion? Also, can you share the ingestions spec. In you spark-submit command, pinot-batch-ingestion-spark plugin is missing. It is located in `${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-${PINOT_VERSION}-shaded.jar`
@kharekartik: Also can you specify the spark version
@kharekartik: Also the --class needs to be `org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand`
@kharekartik: You can view the full guide here -
@gxm.monica: Hi @kharekartik, thank you for your help. I don't need to use Kafka in batch ingestion. Because I used `org.apache.pinot.tools.admin.PinotAdministrator` as the main class before, and it seems to need to load static variable `SUBCOMMAND_MAP` which finally caused the error in this question. Now I change my configuration by the full guide you mentioned above. My new spark job command is like this: ```export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre export HADOOP_VERSION=2.7.2U17-11 export HADOOP_GUAVA_VERSION=11.0.2 export HADOOP_GSON_VERSION=2.2.4 export PINOT_VERSION=0.10.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=/home/xxx/apache-pinot-0.10.0-SNAPSHOT-bin cd ${PINOT_DISTRIBUTION_DIR} ${SPARK_HOME}/bin/spark-submit \ --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ --master "local[2]" \ --deploy-mode client \ --conf "spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre" \ --conf "spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-8.b10.el7_5.x86_64/jre" \ --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dplugins.include=pinot-hdfs -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \ --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-hdfs/pinot-hdfs-${PINOT_VERSION}-shaded.jar" \ local:// ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/transcriptData/sparkIngestionJobSpec.yml``` My new ingestions spec is like this: ```name: 'spark' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner' segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner' extraConfigs: stagingDir: jobType: SegmentCreationAndTarPush inputDirURI: '' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: '' overwriteOutput: true pinotFSSpecs: - scheme: hdfs className: org.apache.pinot.plugin.filesystem.HadoopPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' tableSpec: tableName: 'transcript' pinotClusterSpecs: - controllerURI: '' pushJobSpec: pushAttempts: 2 pushRetryIntervalMillis: 1000``` My spark version is `2.4.5` when I executed, I got an error like this: ```ERROR StatusLogger Unrecognized format specifier [d] ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [thread] ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern. ERROR StatusLogger Reconfiguration failed: No configuration found for '70dea4e' at 'null' in 'null' Exception in thread "main" java.lang.NoSuchMethodException: org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main([Ljava.lang.String;) at java.lang.Class.getMethod(Class.java:1786) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:42) at $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)``` It seems like that `org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand` don't have a `main` method?
@kharekartik: Yes, that was a bug which is fixed now. Can you pull the latest code from master and build it.
@kharekartik: Or you can simply add the following to the class ``` public static void main(String[] args) { PluginManager.get().init(); (new CommandLine(new LaunchDataIngestionJobCommand())).execute(args); }```
@achavan1: @achavan1 has joined the channel
@harish.bohara: @harish.bohara has joined the channel
@mikesheppard2: @mikesheppard2 has joined the channel
@harish.bohara: Hi.. if I have a large number of segments (for realtime tables). Is there a setting which merges segment at background? Or Any cron Job?
@mayanks: Yep minion tasks
@mayanks:
@harish.bohara: It seems it is only for offline tables: The Minion merge/rollup task allows a user to *merge small segments into larger ones, through which Pinot can potentially benefit from improved disk storage and the query performance*. For complete motivation and reasoning, please refer to the design doc above. Currently, we only support *OFFLINE table APPEND use cases*.
@mayanks:
@mayanks: You can use this in conjunction with managed offline flow
@harish.bohara: :+1:

#random

@achavan1: @achavan1 has joined the channel
@harish.bohara: @harish.bohara has joined the channel
@mikesheppard2: @mikesheppard2 has joined the channel