Apache Pinot Daily Email Digest (2021-06-02)

Pinot Slack Email Digest Wed, 02 Jun 2021 19:00:35 -0700

#general

@humengyuk18: I’m getting slow regexp_like performance, for 0.3 billion rows, it is costing nearly 2 secs to match a prefix for a column, but in Druid, the same data using `like` operator returned instantly. Is there any configs I can apply to speed up this kind of query?
@steotia: have you tried text index ?
@humengyuk18: Text index will require raw value column, but I’m getting array out of bound exception previously using raw value index.
@steotia: Text index is supported on both raw and dictionary columns
@steotia: What is the error you see when creating/using text index ?
@humengyuk18: I only looked at the documentation, saying only raw value is supported. Is this feature from last release?
@steotia: Sorry that's my bad. Text index on dictionary columns has been supported for quite some time (I think 1 or 2 release old). I will update the documentation
@steotia: Can you share how you are setting up text index in table config ?
@humengyuk18: I will try text index in a test table, see if there are any errors.
@steotia: Sure. Also, it should work for raw columns as well. Please share the call stack for out of bounds error. The only error we have seen in the past with raw columns is the integer overflow which was fixed with new segment format that supports larger string column values.
@humengyuk18: Will text index have a memory overhead?
@steotia: Should not. We are running it on raw data where each string value can be as large as upto 2 million characters. However for such cases, disabling dictionary is preferable since dictionary creation will increase heap usage and GC pressure. The text index itself should not introduce any significant memory overhead.
@humengyuk18: Looks like text index is not using consuming segment data? Text index is only built when generating segment?
@steotia: It uses consuming segment as well. Let me know and we can jump on a call to see what's going on
@saurabhd336: @saurabhd336 has joined the channel
@bcbazevedo: @bcbazevedo has joined the channel
@ming.liu: @ming.liu has joined the channel
@kanleecarro: @kanleecarro has joined the channel
@hari.prasanna: @hari.prasanna has joined the channel
@richard.hallier: @richard.hallier has joined the channel
@bowenwan: @bowenwan has joined the channel
@sharma.vinit: @sharma.vinit has joined the channel

#random

@saurabhd336: @saurabhd336 has joined the channel
@bcbazevedo: @bcbazevedo has joined the channel
@ming.liu: @ming.liu has joined the channel
@kanleecarro: @kanleecarro has joined the channel
@hari.prasanna: @hari.prasanna has joined the channel
@richard.hallier: @richard.hallier has joined the channel
@bowenwan: @bowenwan has joined the channel
@sharma.vinit: @sharma.vinit has joined the channel

#feat-presto-connector

@prabha.cloud: @prabha.cloud has joined the channel

#troubleshooting

@saurabhd336: @saurabhd336 has joined the channel
@patidar.rahul8392: Hi all , I am trying to push hdfs data in hybrid table. I have added offline table in pinot and now trying to push the hdfs file. When I am executing the final Hadoop jar command. It's showing pinot-plugins.tar.gz doesn't exist. Someone kindly suggest. Error: File file:/home/rah/hybrid/staging/pinot-plugin.tar.gz doesn't exits. I am attaching my config file. Here /user/hdfs is my hdfs location and /home/rah is local location .P.s. for staging and outputdir if I am giving hdfs Location then it's giving error. "Wrong FS:" hdfs://location-of- inputdir/filename.txt, expected: file:/// @ken @elon.azoulay @slack1 @tingchen @npawar @fx19880617 @mayanks Kindly suggest.
@elon.azoulay: Hi @patidar.rahul8392 are you using the gcs plugin? Or are you on s3?
@fx19880617: I will take a look the plugin jar issue for hdfs
@fx19880617: This job is using hdfs not s3 I think
@patidar.rahul8392: Hi @elon.azoulay I m using hdfs
@ken: A few issues with your job spec: 1. You need to use `` for your staging directory. 2. You need to use `` for your `outputDirURI`.
@ken: And you need to have a `configs:` section inside of the pinot FS specs section, which has `hadoop.conf.path`. E.g. something like: ``````
@ken: ```pinotFSSpecs: - # scheme: used to identify a PinotFS. # E.g. local, hdfs, dbfs, etc scheme: hdfs className: org.apache.pinot.plugin.filesystem.HadoopPinotFS configs: hadoop.conf.path: '/root/hadoop-ops/config/master/'```
@ken: Also it would be good to include the stack trace with the error message.
@ken: I think if you don’t have the hadoop.conf.path set, then Pinot falls back to the default file system, which is why you get the errors about “wrong FS”
@fx19880617: @patidar.rahul8392
@patidar.rahul8392:
@patidar.rahul8392: @fx19880617 @ken this is complete log details when I am using local path as staging and output dir.
@fx19880617: so i guess you start the job from your local this means hadoop job tries to add this uri into dist cache `/home/rah/hybrid/staging/pinot-plugins.tar.gz`
@fx19880617: how do you submit the hadoop job?
@patidar.rahul8392: Ok @fx19880617 hadoop jar \ ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with- dependencies.jar \ org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpechybrid.yaml
@fx19880617: this staging dir should be on hdfs as well I think
@patidar.rahul8392: Ok @fx19880617 let me try
@patidar.rahul8392: @ken @fx19880617 I have given output and staging dir as hdfs dire same as I given for input directory just created new dir on same location and passed in config. And added one extra property as hadoop.conf.path: '/etc/Hadoop/conf/' Where all my Hadoop configuration files are available i.e. hadoop-site.xml, core-site.xmo etc. But still it's giving the same error wrong FS.
@patidar.rahul8392: This is how my files looks like now. Kindly suggest @ken @fx19880617
@ken: Your `hadoop.conf.path` is in the wrong section. You have it as part of the `file` specification, but it needs to be part of the `hdfs` specification.
@ken: You should be able to remove the `file` scheme section from the `pinotFSSpecs` configuration
@patidar.rahul8392: Error logs
@patidar.rahul8392: Ok let me remove file section and retry
@patidar.rahul8392: Thanks alot @fx19880617 @ken @elon.azoulay It Worked.:clap:
@fx19880617: @ken huge thanks! we should document this into FAQ
@fx19880617: btw, what’s your final config file look like, wanna compare with the init one
@fx19880617: so I can update the documentation to make it more clear
@bcbazevedo: @bcbazevedo has joined the channel
@ming.liu: @ming.liu has joined the channel
@kanleecarro: @kanleecarro has joined the channel
@hari.prasanna: @hari.prasanna has joined the channel
@machhindra.nale: Team, I added new index and sortedColumn in the table config which was already ingesting data from Kafka stream. I used “AddTable” command to update the index. “jsonIndexColumns”: [ “entityMap” ], “sortedColumn”: [ “metric” ] I performed “Reload All Segments” in the UI. Is there any way to know if the indexing is complete?
@g.kishore: check in the table page, reload status button
@machhindra.nale:
@g.kishore: ah, not sure why its not supported for real-time table @npawar ^^
@npawar: this was from a contributor in open source. He’s only done it for offline.
@npawar: @omkar.halikar14 is working on adding the realtime support
@npawar: meanwhile, you can look at the status of indexing, bu going to the segment directory on the server instance, and looking at metadata.properties
@richard.hallier: @richard.hallier has joined the channel
@ken: I’m running into an issue when building segments with 0.7.1 that didn’t occur with 0.6.0, due to (I think) using a Unicode code point for my `multiValueDelimiter`
@ken: The relevant bit of my job file is: ```recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs: multiValueDelimiter: '\ufff0'``` With 0.6.0 this works fine. With 0.7.1 I get: ```shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `char` out of VALUE_STRING token at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig["multiValueDelimiter"]) at shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1442) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1216) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1126) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:448) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:405) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at shaded.com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1350) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.spi.utils.JsonUtils.jsonNodeToObject(JsonUtils.java:117) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:88) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199) ~[pinot-batch-ingestion-standalone-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_291] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]```
@mayanks: I am guessing we moved to a newer version of jackson that is having trouble reading the delimiter into a char?
@ken: Well, it’s OK if I use `multiValueDelimiter: 'a'`, but it’s not OK if I do something like `multiValueDelimiter: '\u0040'`. Where in the code is the job yaml file converted to a RecordReaderSpec?
@mayanks: Check `IngestionJobLauncher.java`
@mayanks: Assuming that you are using it
@ken: Yes, thanks - working on a unit test to see if I can find the issue :)
@mayanks: Cool, thanks
@mayanks: Either there's a code change or a lib change that is not able to handle your delim.
@bowenwan: @bowenwan has joined the channel
@sharma.vinit: @sharma.vinit has joined the channel

#pinot-dev

@ken: When running `mvn clean install -DskipTests -Pbin-dist` on master, I got a failure: `on project pinot-jdbc-client: *Some files do not have the expected license header`.*
@ken: The specific files were: ```[INFO] Checking licenses... [WARNING] Unknown file extension: /Users/kenkrugler/git/pinot-ken/pinot-clients/pinot-jdbc-client/.externalToolBuilders/Maven_Ant_Builder.launch [WARNING] Missing header in: /Users/kenkrugler/git/pinot-ken/pinot-clients/pinot-jdbc-client/maven-eclipse.xml``` Is this due to cruft in my filesystem, or some missing exclusions that ought to be there, or something else?
@fx19880617: `mvn license:format`?
@fx19880617: I think there are some ignored files without header?
@ken: Yes - e.g. the `maven-eclipse.xml` looks like a generated file (not under source control). Same for the `.externalToolBuilders` directory

#getting-started

@prabha.cloud: @prabha.cloud has joined the channel

#fix_llc_segment_upload

@ssubrama: I think you have one major part unimplemented as yet. You should not be fetching the segments of a table when the periodic task starts. I am not sure if by that time, the controller leadership has been decided. Ideally, you should fetch this when the leadership is decided. Please chat with @jlli to understand this better and see if a callback can be registered with the lead controller manager.
@changliu: OK. I think a callback func will be a right solution
@ssubrama: It may be a bit tricky to set up, etc. You may need to introduce a registration and callback mechanism, perhaps scheduled in a thread (like helix does)
@changliu: I think if that is the case, I may need to open up a new pr for this. For this pr, do you think if it’s OK just to fix the segment cached from committing phase?
@changliu: After we add this call back registration, we can add the ZK access part to LLCRealtimeManager
@changliu: What do you think?
@ssubrama: That may be fine, but then on a controller restart, we will lose the cache, right?
@changliu: That’s right
@changliu: So we need a ZK scan
@changliu: But ZK scan logic part depends on the controller leadership change, i.e. registration/ callback
@changliu: So I want to separate this two first
@ssubrama: If you are ok with that in the short run, then you can check in as it is (after addrsesing some of the other comments) and put a TODO in front of the `setupTask` method that there is a race condition there in that the controller leadership may not be decided by the time the method is called. In the next PR, you can fix it. Before that, you can also check with jack, how to get notified. [2:02 PM] Oh, another solution to this (without introducing callbacks) is to keep a boolean whether it is needed to download the segment names. If the boolean is true, then download it (when the table is being processed), and initialize the queue. Otherwise, use the queue. [2:03 PM] I think this solution may work a little better since we don't download all the table at the same time. We process a table, and then download the next one.
@changliu: :ok_hand:
@jlli: @jlli has joined the channel
@changliu: Hi @ssubrama, I just talked with @jlli about the leadership change callback. We can use `addPartitionLeader` and `removePartitionLeader` . But since they are partitioned based, controller can receive multiple state transition within a short period of time.
@jlli: one workaround is to add a sleep time and count the zk access request only once
@ssubrama: @jlli the need here is to get notified on mastership changes, and invalidate the cache (of bad segments).
@ssubrama: I am not awre of addPartitinLeader or removePartitionLeader. Are these callbacks already offered?
@jlli: for a single pinot controller, the mastership changes when a helix state transition is received, that’s where `addPartitinLeader` `removePartitionLeader` gets called
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2021-06-02)

#general

#random

#feat-presto-connector

#troubleshooting

#pinot-dev

#getting-started

#fix_llc_segment_upload

Reply via email to