Apache Pinot Daily Email Digest (2021-01-27)

Pinot Slack Email Digest Wed, 27 Jan 2021 18:00:30 -0800

#general

@kiwisoft2011: @kiwisoft2011 has joined the channel
@vickysort: @vickysort has joined the channel
@rajady: @rajady has joined the channel
@chenjie.sau201: @chenjie.sau201 has joined the channel
@harold: @harold has joined the channel

#random

#feat-presto-connector

@hbwang89: @hbwang89 has left the channel

#feat-upsert

@hbwang89: @hbwang89 has left the channel

#feat-better-schema-evolution

@hbwang89: @hbwang89 has left the channel

#troubleshooting

@kiwisoft2011: @kiwisoft2011 has joined the channel
@humengyuk18: Hi team, I’m getting a class not found exception when doing a SegmentCreationAndUriPush job, the `org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner` class cannot be found, below is my job config: ```executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndUriPush inputDirURI: '/root/fetrace_biz/data/' includeFileNamePattern: 'glob:**/*' outputDirURI: '' overwriteOutput: true pinotFSSpecs: - scheme: hdfs className: org.apache.pinot.plugin.filesystem.HadoopPinotFS configs: hadoop.conf.path: '/opt/hdfs/' - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader' tableSpec: tableName: 'fetrace_biz' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: - controllerURI: ''``` exception stack is: ```2021/01/27 03:53:03.942 ERROR [PinotAdministrator] [main] Exception caught: java.lang.RuntimeException: Failed to create IngestionJobRunner instance for class - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:137) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:123) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_275] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_275] at org.apache.pinot.spi.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:80) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:293) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:264) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:245) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:135) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] ... 4 more```
@ken: Does your Pinot distribution directory have a `plugins` sub-dir, which contains a *`pinot-batch-ingestion` sub-dir?*
@humengyuk18: Yes, it has all the plugins
@ken: Including `pinot-batch-ingestion-standalone-*.jar`?
@humengyuk18: Yes
@ken: What’s the command line you’re using to launch the ingest job?
@humengyuk18: `/opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile fetrace_biz/fetrace_biz-job-spec.yml` JAVA_OPTS=-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs CLASSPATH_PREFIX=/root/hadoop-lib/*
@ken: I’m running successfully without setting those java options, and executing `bin/pinot-admin.sh`. Though I did have to copy some of the Hadoop jars into my Pinot lib sub-dir. Wondering what happens if you get rid of the -Dplugins.include parameter, as I thought Pinot would include everything in the plugins dir by default.
@ken: I think if you specify `plugins.include` then it only includes those plugins (comma-separated list)
@fx19880617: can you try to do `-Dplugins.include=pinot-hdfs,pinot-json,pinot-batch-ingestion-standalone`
@fx19880617: or just remove `-Dplugins.include` ,then the ingestion job will load all the plugins
@humengyuk18: if not specifying JAVA_OPTS, I will get Wrong FS exception ```java.lang.IllegalArgumentException: Wrong FS: hdfs:/pinot/controller/fetrace_biz/, expected: file:///```
@fx19880617: do you have full stackstrace
@ken: I think you need to specify the protocol (`file:/`) for the inputDirURI
@humengyuk18: full stack trace: ```Exception caught: java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:123) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs:/pinot/controller/fetrace_biz/, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730) ~[hadoop-common-3.1.1.3.1.0.0-78.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86) ~[hadoop-common-3.1.1.3.1.0.0-78.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:548) ~[hadoop-common-3.1.1.3.1.0.0-78.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:534) ~[hadoop-common-3.1.1.3.1.0.0-78.jar:?] at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705) ~[hadoop-common-3.1.1.3.1.0.0-78.jar:?] at org.apache.pinot.plugin.filesystem.HadoopPinotFS.mkdir(HadoopPinotFS.java:78) ~[pinot-hdfs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:130) ~[pinot-batch-ingestion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-255202ec4fc7df2283f7c275d8e9025a26cf3274]```
@fx19880617: hmm, does it mean that we need to put namespace inside the output dir uri?
@fx19880617: btw, there is an off here ```recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'```
@fx19880617: your data format is csv, but class name is jsonrecord reader
@humengyuk18: I see, thanks
@ken: FWIW, I wound up having to fully specify the HDFS URI, as in `outputDirURI: ''`
@humengyuk18: the job should be able to read hdfs config from the hdfs config dir `/opt/hdfs`
@ken: calling it a night, good luck
@humengyuk18: thanks, I will try it
@fx19880617: so you also mount the hdfs config to path `/opt/hdfs` ?
@fx19880617: @chinmay.cerebro @tingchen do you recall what has been done at uber
@humengyuk18: yes, its mounted under `/opt/hdfs`
@humengyuk18: @fx19880617 turned out to be the hdfs xml config has some error, solved by provide a correct config, thanks for your help
@fx19880617: :+1:
@fx19880617: is there anything we can do we prevent this happening again ? Like give more clear stack trace/ error messages?
@humengyuk18: I think the trace is unclear, also the doc need to be more detail about setting up hdfs.
@humengyuk18: @fx19880617 Maybe we can validate the `hdfs.conf.path` before job launch, so if conf path not exist, user should provide namenode instead?
@fx19880617: agreed
@fx19880617: I think fs validation can help root cause the problem
@vickysort: @vickysort has joined the channel
@rajady: @rajady has joined the channel
@chenjie.sau201: @chenjie.sau201 has joined the channel
@harold: @harold has joined the channel
@harold: Hi. I'm trying to follow the steps here: Does the pinot schema for the corresponding kafka topic need to exactly match? Does Pinot support flattening the data? Currently, we have messages in Kafka in json-format. I'm looking at setting up Pinot to ingest data from this topic. The dimensions are currently nested inside a "labels" dictionary of the Kafka message.
@wrbriggs: I’m not aware of any generic support for flattening the data, but you have a few options - one is to just stick with the high level schema w/ your `labels` dictionary, and use a JSON Index () to speed up querying against it. That obviously isn’t ideal if you also want other types of indices.
@wrbriggs: Another option would be to use an ingestion transformation to extract the fields you want out of the dictionary into top-level columns:
@wrbriggs: As you can see, the documentation for Flattening implies that extracting attributes from complex objects is ‘TBD’, and 1:Many requires implementing a custom Decoder:
@harold: :+1:Thanks. Let me read up on the docs.
@wrbriggs: You could also use a custom StreamMessageDecoder to simplify what could be a painful set of individual ingestion transformations if you have a lot of columns: Lastly, you could use a custom Kafka Streams job or Kafka Connect pipeline to transform and push the data into another topic, in a format more friendly to ingestion, but obviously that puts more load on Kafka, and adds complexity

#pinot-dev

@ken: I’ve run into a few bugs in Pinot caused by `PinotFS.listFiles()` implementations not returning the protocol with the path. So you get back `/user/hadoop/blah`, not `hdfs:///user/hadoop/blah`. When those paths get used later, without knowledge of the file system, then you run into problems. Does anyone know why `listFiles()` (and maybe other methods in a PinotFS implementation) don’t include the protocol?
@fx19880617: I feel those are bugs in general, it should give the scheme.
@ken: Wonder what would break with a change like that :slightly_smiling_face:
@fx19880617: @tingchen are you using this ?
@tingchen: which PinotFS subclass are you using? HadoopFS? we are using a Uber internal Pinot FS for HDFS.
@ken: @tingchen HadoopFS
@tingchen: ```if (_hadoopFS.exists(path)) { // _hadoopFS.listFiles(path, false) will not return directories as files, thus use listStatus(path) here. List<FileStatus> files = listStatus(path, recursive); for (FileStatus file : files) { filePathStrings.add(file.getPath().toUri().getRawPath()); } } else {``` I think getRawPath() is the issue
@tingchen: our alternative impl use getPath() which is similar. so far we did not get into much issues. May I know what problems did you encounter?
@ken: E.g. with Hadoop batch segment generation, the input paths (written out to temp files, for processing in the job) are all `/user/hadoop/blah` paths, without the scheme, so the mapper then doesn’t know how to process them (since in theory they could be `file:///user/hadoop/blah` paths)
@ken: And in the stand-alone segment generation code, this same issue meant the code called a routine to “fix up” the paths, but that fix up code had a bug (that I recently fixed) where it would drop the namenode part of the path. So you’d get `hdfs:///user/hadoop/blah`, not the required ``
@ken: Haven’t tried this…does the PinotFS for files return `file:///user/blah`, or just `/user/blah`?
@tingchen: Based on the API of PinotFS:
@tingchen: ```/** * Lists all the files and directories at the location provided. * Lists recursively if {@code recursive} is set to true. * Throws IOException if this abstract pathname is not valid, or if an I/O error occurs. * @param fileUri location of file * @param recursive if we want to list files recursively * @return an array of strings that contains file paths * @throws IOException on IO failure. See specific implementation */ public abstract String[] listFiles(URI fileUri, boolean recursive) throws IOException;```
@tingchen: It return file "paths" which is technically the path component of the URI w/o scheme/authority and so on
@tingchen: so the HadoopFS impl is conforming to the interface here. For your issue, can you get the scheme from URI?

#pinot-0-5-0-release

@hbwang89: @hbwang89 has left the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-01-27)

#general

#random

#feat-presto-connector

#feat-upsert

#feat-better-schema-evolution

#troubleshooting

#pinot-dev

#pinot-0-5-0-release

Reply via email to