Apache Pinot Daily Email Digest (2021-04-13)

Pinot Slack Email Digest Tue, 13 Apr 2021 19:00:40 -0700

#general

@fcolopera89: @fcolopera89 has joined the channel
@teo: @teo has joined the channel
@sleepythread: I am trying to start pinot with hdfs as deep storage but getting error while starting the server ```bin/start-server.sh -zkAddress pinot1.plan:2181,pinot2.plan:2181,pinot3.plan:2181 -configFileName conf/server.conf``` and server config are ```pinot.server.instance.enable.split.commit=true pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS hadoop.conf.path=/local/hadoop/etc/hadoop/ pinot.server.storage.factory.hdfs.hadoop.conf.path=/local/hadoop/etc/hadoop/ pinot.server.segment.fetcher.protocols=file,http,hdfs pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher pinot.server.instance.dataDir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/data/PinotServer/index pinot.server.instance.segmentTarDir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/data/PinotServer/segmentTar```
@dlavoie: Hi, can you move this conversation to <#C011C9JHN7R|troubleshooting>? Also, error logs would be helpful to understand what is wrong.
@sleepythread: Sorry did not knew there was such channel. Will move it there thanks
@sleepythread: In UI documentations its written ```-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs```
@dlavoie: the documentation does mention that but, the `-Dplugins.include=pinot-hdfs` flag will deactivate all other plugins. Just configuring `-Dplugins.dir=/opt/pinot/plugins` will autoscan all available plugins including hdfs
@sleepythread: ```[akashmis...@pinot1.mlan apache-pinot-incubating-0.6.0-bin]$ bin/start-server.sh -zkAddress pinot1.mlan:2181,pinot2.mlan:2181,pinot3.mlan:2181 -configFileName /home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/conf/server.conf -Dplugins.dir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/plugins 2021/04/13 14:53:14.235 ERROR [PinotAdministrator] [main] Error: "-Dplugins.dir=/home/akashmishra/hpgraph/apache-pinot-incubating-0.6.0-bin/plugins" is not a valid option``` When i am adding it on config then i getting following error.
@dlavoie: The JVM flags `-D` must passed passed through the `JAVA_OPTS` env variable
@g.kishore: We're happy to see now listed on the as one of the top platforms to assess. This is a big accomplishment for the entire Pinot community. Thanks to everyone that helped us get there!
@gabuglc: @gabuglc has joined the channel
@sosyalmedya.oguzhan: Helloo, We've tried to use AliCloud OSS (like S3 in Amazon) as Pinot deep storage. There is no pinot-oss deep storage plugin right now but we are able to use OSS as pinot deep storage using the pinot hdfs file system plugin. We created a documentation for that;
@g.kishore: Thanks a lot for this contribution
@toasifmohammed: @toasifmohammed has joined the channel
@aaron: If I understand right, when I batch ingest a set of parquet files, the job will create a segment for each parquet file and then will upload it all to Pinot? Is that right? If so, are there any guidelines about picking segment sizes for optimal query performance?
@mayanks: Yes, all data is internally stored in Pinot’s columnar indexed format.
@mayanks: You want to avoid large number of tiny segments. If your data allows, few hundred MB per segment is a good size
@aaron: Also when I run the batch ingestion job I see some debug output about dictionary encoding the columns, including numeric metric columns. Does that mean it's dictionary encoding the data in Pinot's internal format? Say I'd like to compute averages and quantiles of these metrics grouped by different dimensions -- is dictionary encoding best for that or should I disable it? Or is what I'm seeing not relevant to query performance
@mayanks: By default most columns are dictionary encoded, and work well. It helps to disable in certain cases like strings with really high cardinality. For your case you can assume it works fine, unless you are seeing issues.
@karinwolok1: :mega: Just a reminder! :tada: *If anyone is interested in presenting at the Apache Pinot event series, please submit today!* Presentations will be scheduled in May, June, July. Topics will be a variety on use cases, how-to's, your experiences working with Pinot, "getting started with X in Pinot", features and connectors. There's really no limit with types of topics and it can be a work in progress! Even if your use case isn't fully built out, many people might be interested to see what are you working on, what made you think of Pinot, what were you doing before, what led you here, what works for you and what doesn't, how you compared your options, comparisons of Pinot and other solutions, etc. *Feel free to reach out to me if you have questions!*
@tingchen: @jackie.jxt @npawar do you know *JSONPATHARRAY*(jsonField, 'jsonPath') can be used in a WHERE clause to find out if the array contains a certain value?
@jackie.jxt: I think you need to use `JsonExtractScalar` with an array type to extract a MV field
@tingchen: is there an example or syntax manual for this?
@jackie.jxt: E.g. `where jsonExtractScalar(json, '$.a', 'STRING_ARRAY') = 'abc'`
@jackie.jxt:
@tingchen: I suppose the above feature can not utilize Json index, right?
@tingchen: probably good for medium or small use cases
@tingchen: `where jsonExtractScalar(json, '$.a', 'STRING_ARRAY') = 'abc'` the _expression_ means the list contains a value `abc`?
@jackie.jxt: Yes
@jackie.jxt: Json index can be used to solve this problem
@jackie.jxt:
@tingchen: I am still a bit confused about which one to use in the where clause. `jsonExtractScalar` or `JSON_MATCH`
@jackie.jxt: If you have json index generated for the column, `JSON_MATCH` should be much faster
@tingchen: got it. thanks.
@aaron: If I've already created a table and batch ingested data, can I add a star-tree index after the fact or do I need to start from scratch?
@g.kishore: You can add Star tree index later.. all indexes can be added dynamically
@aaron: Thanks -- do I do that by updating the table config?
@g.kishore: right
@g.kishore: update table config and invoke reloadsegments api
@jackie.jxt: You can refer to this doc:
@jackie.jxt: Remember to set `enableDynamicStarTreeCreation` if you want to add a star-tree on the fly
@aaron: Thanks!
@aaron: In this case what does it mean to compute the star-tree on the fly?
@aaron: Do I need to set `enableDynamicStarTreeCreation` in order to be able to update the table config and reload segments like Kishore said, or is this something different?
@jackie.jxt: Yes, you need to set `enableDynamicStarTreeCreation` then server will generate the star-tree index configured in the table config
@aaron: Thanks!
@aaron: It looks like the reloadsegments API finished instantly -- should I expect it to take a while to reindex?
@g.kishore: Yes.. there is a status api you can invoke to check the status
@aaron: The table state API?
@aaron: I see ```{ "state": "enabled" }```
@aaron: Ok I think I got this working. At first I used the "reload" API which didn't seem to do anything. Then I tried the "reset" API and it did
@aaron: If I have SUM and COUNT in the star tree index's `functionColumnPairs`, will `AVG` implicitly be able to use the star tree index or do I need to put `AVG` in that list too?
@jackie.jxt: You'll need to explicitly put `AVG` in that list
@jackie.jxt: But very good point that we should be able to get the `AVG` with `SUM/COUNT`. Can you please submit a github issue for this?
@aaron:
@karinwolok1: Welcome new :wine_glass: Pinot members! :wave: Tell us about yourselves! How'd you find the community? What are you working on? @toasifmohammed @gabuglc @sg @hochuen.wong @fcolopera89 @teo @ankitsultana @karthikbvnet @social.kangaroo.hop @alicelyu @ilchernenko @ravishankar.nair @raahulgupta07 @gaurav.madaan @shyam.m @vaibhav.sinha @rymurr @omkar.halikar14 @ricardo.bernardino @kulbir.nijjer @xysmiracle @sunilkumar.tc89 @wuwenw
@yupeng: hey, is there a plan to add a table creation module in the cluster management UI?
@g.kishore: thats already supported right
@yupeng: hmm, i did not find it on the UI...
@g.kishore:
@yupeng: oh.. under table.. thanks..
@yupeng: is there a way to import schema from avro or json?
@g.kishore: there is a admin tool but its not hooked up in the ui
@g.kishore: ```AvroSchemaToPinotSchema```
@yupeng: got it. yeah, it'll be a cool feature to integrate it into the uI
@g.kishore: problem is we dont want to bring dependency on avro, thrift, protobuf, parquet etc.. took a long time to clean it up
@g.kishore: its hard to do it in a generic way without depending on them explicitly
@yupeng: i see. then how does the command work?
@yupeng: to get the dependency?
@g.kishore: Command is in a different module Pinot tools
@yupeng: i see. then it does need a separate web server for the UI to get around this issue
@g.kishore: Or model it as spi

#random

@marta: Nice to see Pinot coming through in the ThoughtWorks radar!
@fcolopera89: @fcolopera89 has joined the channel
@teo: @teo has joined the channel
@gabuglc: @gabuglc has joined the channel
@toasifmohammed: @toasifmohammed has joined the channel