Apache Pinot Daily Email Digest (2021-11-03)

Pinot Slack Email Digest Wed, 03 Nov 2021 19:00:39 -0700

#general

@cgregor: @cgregor has joined the channel
@diogodssantos: @diogodssantos has joined the channel
@nair.a: HI Team, We are doing pinot poc for offline ingestions currently. Currently facing an issue , while ingesting segment from s3 to pinot. ```2021/11/03 08:29:22.109 INFO [SegmentFetcherFactory] [HelixTaskExecutor-message_handle_thread] Segment fetcher is not configured for protocol: s3, using default 2021/11/03 08:29:22.109 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread] Caught exception while fetching segment from: to: /tmp/data/pinotSegments/mytable_OFFLINE/tmp-mytable_OFFLINE_2021091800_2021091800_0-90b8d75e-b2e8-4e4f-b115-36e5528c37cf/mytable_OFFLINE_2021091800_2021091800_0.enc java.lang.IllegalStateException: PinotFS for scheme: s3 has not been initialized at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.spi.filesystem.PinotFSFactory.create(PinotFSFactory.java:78) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808]``` Following are our conf: Server conf: pinot.server.instance.enable.split.commit=true pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=us-east-1 pinot.server.segment.fetcher.protocols=s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher Controller conf: controller.data.dir= controller.local.temp.dir=/tmp/pinot/ controller.enable.split.commit=true pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-east-1 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
@sumit.l: @sumit.l has joined the channel
@karinwolok1: :speaker: This conference is still accepting speaker submissions!!! Should be a good one. If you have a good story about your Apache Pinot use case, please submit here :speaker:
@greyson: Coming from a relational database perspective, I've had some difficulty conceptualizing what my data might look like in Pinot. Is the standard to have multiple tables like in a RDBMS and query them relationally using something like Presto, or should I strive to have less tables with more columns that remove the need for relational querying? If the latter is preferable, is that still the case when the table would have to contain many columns to replace the relational structure and many of those columns would need to contain things like array or JSON object
@bobby.richard: Definitely the latter
@bobby.richard: I am new to Pinot as well, but from what I understand wide, denormalized tables are the norm
@tyler773: Thanks @bobby.richard! So it's preferable, within reason, to have duplicated data, to some extent, since the size of the tables in terms of rows are less of an issue in terms of query speed than they would be in an RDBMS?
@mayanks: Thanks @bobby.richard, yes that is the more common usage. Having said that, Pinot does support lookup joins (on dimension table). And folks have also used Presto/Trino connector for Pinot to do more complex queries (joins/nested queries etc)
@mayanks: @tyler773 yes that is correct. Pinot is built for performance, and can scale very well with size of data (num rows, or otherwise)
@greyson: And it's still best practice even when those columns become more complicated? Like would it be a problem to have an array-column with 100 entries in it? What about 1000? 10,000? Is it still preferable to have that data be stored in a column at that point instead of in its own table and relationally joined? Or, and I assume this is not the right answer, but is a middle-ground solution to just duplicate data across rows to avoid large array column values?
@ken: Hi @greyson - due to how Pinot can use a dictionary to compress columnar data, “duplicate data across rows” typically doesn’t add a lot to the size of the table, or at least that’s been our experience with having denormalized tables.
@greyson: So then, @ken, would it be a good idea to have multiple "rows" with duplicated data and a single value column instead of one row with an array column?
@ken: If nothing else is changing but the value in that one column, then we use an MV (multi-value) column and have a single row.
@ken: e.g. we have a column with the unique terms, derived from another column containing a blob of text. That’s stored as a MV column, and we can easily query against those terms to filter to a sub-set of rows.
@greyson: Our pipeline at present is that we have a single immutable data type represented in a base table, and then through multiple steps in our processing pipeline we add data to various tables that relate to the base/core table. When you say "If nothing else is changing but the value in that one column" are you implying that the rest of the columns should be largely immutable as well?
@ken: If you have say two MV columns A & B, and you’ve collapsed multiple row values into those two columns, then you’ve lost the ability to filter to rows where column A = x and column B = y, since those values could have come from two different pre-collapsed rows. But it sounds like your use case is different, in that you’re adding additional attributes to a base row, thus there’s no row collapsing going on.
@greyson: Awesome, thanks for your input :slightly_smiling_face:
@g.kishore: this is such an amazing thread. Thanks Ken!
@diogo.baeder: Just a random comment/praise: the Pinot open source community support is amazing! Thanks for that, guys! I'm looking forward for my next steps in using it in production :heart:
@mayanks: Thanks so much for the kind words @diogo.baeder, would love to see you take your use case to production using Apache Pinot.
@diogo.baeder: I'll make sure we have some sort of blog post or video or similar, on the matter. :slightly_smiling_face:
@mayanks: That would be amazing :pray:
@ashish: Pinot does not support “NOT” operator and there is no regexp_not_like. So is there any way to do the equivalent of “NOT regexp_like(…,…) at all in Pinot?
@jackie.jxt: Not currently. We should add `NOT` operator support to pinot. Could you please file an issue about this?
@gqian3: Hi team, is there a Pinot query to find out when is the last ingest time of a offline table?
@mayanks: You mean the time when the segment was pushed or the max value of time column
@mayanks: If latter you can just do sql select max(timeCol)
@gqian3: I mean when the segment are pushed.

#random

@cgregor: @cgregor has joined the channel
@diogodssantos: @diogodssantos has joined the channel
@sumit.l: @sumit.l has joined the channel

#troubleshooting

@cgregor: @cgregor has joined the channel
@diogodssantos: @diogodssantos has joined the channel
@sumit.l: @sumit.l has joined the channel
@nair.a: HI Team, We are doing pinot poc for offline ingestions currently. Currently facing an issue , while ingesting segment from s3 to pinot. ```2021/11/03 08:29:22.109 INFO [SegmentFetcherFactory] [HelixTaskExecutor-message_handle_thread] Segment fetcher is not configured for protocol: s3, using default 2021/11/03 08:29:22.109 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread] Caught exception while fetching segment from: to: /tmp/data/pinotSegments/mytable_OFFLINE/tmp-mytable_OFFLINE_2021091800_2021091800_0-90b8d75e-b2e8-4e4f-b115-36e5528c37cf/mytable_OFFLINE_2021091800_2021091800_0.enc java.lang.IllegalStateException: PinotFS for scheme: s3 has not been initialized at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.spi.filesystem.PinotFSFactory.create(PinotFSFactory.java:78) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808]``` Following are our conf: Server conf: pinot.server.instance.enable.split.commit=true pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.server.storage.factory.s3.region=us-east-1 pinot.server.segment.fetcher.protocols=s3 pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher Controller conf: controller.data.dir= controller.local.temp.dir=/tmp/pinot/ controller.enable.split.commit=true pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.storage.factory.s3.region=us-east-1 pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
@adireddijagadesh: @nair.a you could use this link and check whether you configured Injection job correctly: If it’s still occurring can you please share the `ingestionJobSpec.yaml`
@kchavda: Not sure if it matters but I see following missing on the controller conf. (I am running docker containers). ```pinot.role=controller controller.helix.cluster.name=PinotCluster controller.zk.str=pinot-zookeeper:2181 controller.host= controller.port=9000```
@nair.a: Hey @adireddijagadesh, sharing jobspec ```executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner' jobType: SegmentCreationAndMetadataPush inputDirURI: '' outputDirURI: '' overwriteOutput: true pinotFSSpecs: - scheme: s3 className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: region: 'us-east-1' recordReaderSpec: dataFormat: 'parquet' className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec: tableName: 'my_table' pinotClusterSpecs: - controllerURI: '' pushJobSpec: pushParallelism: 2 pushAttempts: 1 ```
@kchavda: You're able to hit S3 from that box? Using env to pass in access key and secret?
@nair.a: yes we are able to access s3 from the server,
@kchavda: I'm comparing what you've shared with working versions of my jobspec and conf files to read CSV files from S3. I noticed the jobspec is missing schemaURI and tableConfigURI under tableSpec. And the server conf. is missing ```pinot.server.netty.port=8098 pinot.server.adminapi.port=8097 pinot.server.instance.dataDir=/tmp/pinot-tmp/server/index pinot.server.instance.segmentTarDir=/tmp/pinot-tmp/server/segmentTars``` Not sure if these things are directly causing the errors but you can update and give it a shot.
@nair.a: Hey @kchavda,we have the above configs in server and controller. The ingestion is completing with success, but status of the ingested segment is showing as BAD. and upon checking the logs of server, we found this error. Will try to provide additional configs as you mentioned.
@nair.a: Hey @kchavda, till the same error. can i know how you are setting the aws key and secret in server conf?
@adireddijagadesh: @nair.a You could set in controller config as ```pinot.controller.storage.factory.s3.accessKey=****************LFVX pinot.controller.storage.factory.s3.secretKey=****************gfhz```
@adireddijagadesh: Refer this link for more info and different ways of setting:
@kchavda: I followed the tutorial and found it to be very helpful. I also passed the aws key and secret when starting the containers (controller, broker, server, ingestion job): ``` docker create -ti \ --name pinot-server \ --network=pinot-demo \ --env AWS_ACCESS_KEY_ID= \ --env AWS_SECRET_ACCESS_KEY= \ -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms32G -Xmx32G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log" \ --mount type=bind,source=/opt/pinot,target=/tmp \ apachepinot/pinot:0.8.0 StartServer \ -zkAddress pinot-zookeeper:2181 ```
@nair.a: Okay, will try this. Currently we did set creds inside server conf, but not in controller conf.
@luisfernandez: I have been asking around but is there any desire to make pinot pagination work with group by? my current use case kinda would need pagination
@g.kishore: yes.
@g.kishore: this feature request is quite hot.. we will do it!
@luisfernandez: oh this is greatt, are there any plans in place, like timelines or what not or not really just want to have a sense
@g.kishore: Plan is to get it done by Jan..
@g.kishore: Contributions welcome..

#pinot-dev

@ryan: @ryan has joined the channel
@atri.sharma: What's the process to update docs for new features?
@g.kishore: GitBook
@atri.sharma: Please point me to the link and I will get it done right away
@walterddr: @walterddr has joined the channel
@walterddr: tip of master seems broken, looking into it unless someone else already on it
@cgregor: @cgregor has joined the channel

#getting-started

@tyler773: Been trying to just start Pinot locally in a docker container. I'm using pinot version `0.8.0` and `openjdk:11` . I'm on a mac. I'm trying to start the cluster by using the pinot admin commands `StartZookeeper` `StartController` `StartBroker` and `StartServer` as shown in the getting started. However inevitably the controller will go down before I can start the Broker and the Server with this error: `Expiring session 0x100080c84b20005, timeout of 30000ms exceeded` , Is there a way to avoid this?
@g.kishore: Please check the jvm memory params
@tyler773: @g.kishore will do, thank you!
@ryan: @ryan has joined the channel
@navi.trinity: @navi.trinity has joined the channel
@bowenwan: @bowenwan has joined the channel
@cgregor: @cgregor has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org