Apache Pinot Daily Email Digest (2022-05-06)

Pinot Slack Email Digest Fri, 06 May 2022 19:00:38 -0700

#general

@zaikhan: Hi, I have started `PinotController`, `PinotBroker` and `PinotServer` using git branch `multi_stage_query_engine` code, still the join query is not working. Do I need to do something else?
@kharekartik: @walterddr
@walterddr: Yes there are some configurations that needs to be used to enable it. I will create a new PR to enable it by default
@zaikhan: Another thing I noticed, These components are getting started but when I do *Build Project* in IntelliJ, some classes are not found, for eg - ```java: cannot find symbol symbol: class Plan location: package org.apache.pinot.common.proto ```
@walterddr: Yes. Some of the code is code generated so you will have to do mvn install first and tag those directories as generated sources in intellij.
@walterddr: Thanks for the feedback. Will also include this as part of the PR with an instruction section
@zaikhan: @walterddr Could you DM me config that I need to enable, later you could create PR?
@walterddr: yes. we created a quickstart but it is still on my branch you can take it for a try
@haiylin: @haiylin has joined the channel
@jinal.panchal: Hello, I didn't quite get the concept of dimension columns in Pinot. If we have datatypes well-defined for the columns, then what's the significance of specifying Pinot field specification, like metricsField, dimensionFields, etc?
@mayanks: Metrics are things you count/sum/avg/etc. Dimensions are ones you slice/dice (filter/group) by.
@mayanks: While that is the idea, pinot does not enforce these as strict rules. Think of these as hints to pinot to do internal optimizations (for example, metrics by end up being stored without a dictionary, may have a different default null value, etc)
@ashutosh25apr: @ashutosh25apr has joined the channel
@ashutosh25apr: :wave: Hi everyone!
@mayanks: Hi Ashutosh, welcome to the Pinot community.
@mitchellh: Welcome!
@diogo.baeder: So, I just created a table with >40k rows, but with daily segments, 318 segments in total - not good, I want to rollup to monthly segments later -, and defined a JSON index for my main columns which contain dynamic data (data that just can't be defined as static columns). Even trying to brutalize this thing by querying all the data with a limit that surpasses the amount of rows I still get ~600ms queries! Geez, this thing is fast! :slightly_smiling_face:
@mayanks: Yes, it it fast :slightly_smiling_face:. In your case, the data size seems small as well.
@diogo.baeder: It's quite small, yes. 1 year of data, ~215 MB total size. It could easily fit a month of data for each segment - for larger regions of data for us this will be a good size.
@diogo.baeder: Can't wait to test monthly rolled-up segments though. Might make things even better.
@rajat.taya: @rajat.taya has joined the channel
@ryan.persaud: @ryan.persaud has joined the channel
@mathieu.druart: Hi ! this PR : removed the Pulsar plug-in from the Pinot build because of this issue : . Now that the issue is marked as closed, does anyone know if the plug-in will be added back to the build ? Thank you !
@mayanks: Pinot-pulsar plugin does exist already cc: @kharekartik
@mayanks: `pinot-stream-ingestion/pinot-pulsar`
@mathieu.druart: @mayanks yes the plugin exists, but the assembly file doesn't add the jar plugin inside the plugin folder (lines are commented) :
@mayanks: Hmm I thought that was resolved. @kharekartik any insights
@mathieu.druart: we have to build a custom docker image to add the plugin for now
@ysuo: Hi team, I have a question and don’t know how to solve it. How can I extract numOfStas.Policy in Kafka message and save it to a Pinot table field? When I use *transformFunction*, it doesn’t work. { “columnName”: “stas_policy”, “transformFunction”: “jsonPathString(stats, ‘$.text_body.fields.numOfStas.Policy’)” } *And a sample Kafka message is like this:* { “name”: “telemetry_signal_gfw_api_usage”, “stats”: { “text_body”: { “fields”: { “numOfStas”: 0, “numOfStas.Policy”: 21 } } } }

#random

@haiylin: @haiylin has joined the channel
@ashutosh25apr: @ashutosh25apr has joined the channel
@rajat.taya: @rajat.taya has joined the channel
@ryan.persaud: @ryan.persaud has joined the channel

#troubleshooting

@xuhongkun1103: Hi，@xiangfu0 Could you please help me to fix this issue about presto in workflow? Link:
@xiangfu0: For presto fix, you need to make sure the Pom file changes for pinot-spi, pinot-common, etc modules are also reflected in pinot-spi-jdk8, pinot-common-jdk8
@xiangfu0: Those modules are under pinot-connectors/prestodb-pinot-dependencies
@xuhongkun1103: @xiangfu0 Thanks for your prompt reply, Do you mean If I just add one dependency in pinot-common, I have to add this dependency to pinot-common-jdk8 pom file?
@xiangfu0: yes
@xiangfu0: We made sim link for source code
@xiangfu0: But not Pom file
@xiangfu0: So you need to make dependency aligned from both side
@xiangfu0: We saw this issue when try to release for both jdk8 and jdk11
@xuhongkun1103: Got it,Thx
@haiylin: @haiylin has joined the channel
@diogo.baeder: Hi folks! The doesn't tell how to configure that; How can that be done?
@mayanks:
@diogo.baeder: I'm already creating inverted indexes and I understood that if a column is sorted Pinot will create a sorted index for it, but what about "sorted inverted" ones? I just have to define a column that is sorted before ingestion that it should have an inverted index then? There's no specific configuration for "sorted inverted", at least I didn't find it there
@mayanks: That is just saying that sorted forward index also duals as inverted index. There isn’t and additional sorted inverted index. It is a bit confusing, could we make it more readable @mark.needham
@diogo.baeder: Ah, ok then. So if the column is already sorted I don't need to do anything, just ingest it, right?
@mayanks: For real-time I recommend specifying it as sorted in table config
@diogo.baeder: It's for offline ingestion
@mayanks: Ok, then if data is already sorted, that is enough
@diogo.baeder: Ah, cool. Thanks man!
@mark.needham: will edit the docs
@mark.needham: but while I was understanding sorted indexes I wrote this -
@diogo.baeder: I'll take a look, thanks man. I'll use for offline tables though.
@ashutosh25apr: @ashutosh25apr has joined the channel
@rblau: hello! we’re trying to batch ingest segments into our pinot instance, but we are finding that some segments are in a bad state. the stack trace we see from the debug/tables/{tablename} endpoint is like so: `java.lang.IllegalArgumentException: newLimit > capacity: (604 > 28)\n\tat java.base/java.nio.Buffer.createLimitException(Buffer.java:372)\n\tat java.base/java.nio.Buffer.limit(Buffer.java:346)\n\tat java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)\n\tat java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)\n\tat java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)\n\tat org.apache.pinot.segment.spi.memory.PinotByteBuffer.view(PinotByteBuffer.java:303)\n\tat org.apache.pinot.segment.spi.memory.PinotDataBuffer.view(PinotDataBuffer.java:379)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkSVForwardIndexReader.<init>(BaseChunkSVForwardIndexReader.java:97)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.FixedByteChunkSVForwardIndexReader.<init>(FixedByteChunkSVForwardIndexReader.java:37)\n\tat org.apache.pinot.segment.local.segment.index.readers.DefaultIndexReaderProvider.newForwardIndexReader(DefaultIndexReaderProvider.java:97)\n\tat org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexReader(IndexingOverrides.java:184)\n\tat org.apache.pinot.segment.local.segment.index.column.PhysicalColumnIndexContainer.<init>(PhysicalColumnIndexContainer.java:166)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:181)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:121)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:91)\n\tat org.apache.pinot.core.data.manager.offline.OfflineTableDataManager.addSegment(OfflineTableDataManager.java:52)\n\tat org.apache.pinot.core.data.manager.BaseTableDataManager.addOrReplaceSegment(BaseTableDataManager.java:373)\n\tat org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:355)\n\tat org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:162)\n\tat jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n` @luisfernandez and I were wondering what this `capacity` value (28, according to the trace) might be? thanks!
@richard892: hi, looks like an integer overflow
@richard892: the default raw forward index format is v2, which only only supports 2GB per column
@richard892: you can try v3 or v4 which support larger sizes
@luisfernandez: how to do that :smile:
@luisfernandez: and also what does it mean does it mean one of the values in the noDictionaryColumns is just too big?
@luisfernandez: one of the values of the columns in noDictionaryColumns*
@steotia: Is this column configured as a noDictionaryColumn ?
@steotia: You can configure v3 as follows ```"fieldConfigList": [ { "encodingType": "RAW", "name": "columnName", "properties": { "deriveNumDocsPerChunkForRawIndex": "true", "rawIndexWriterVersion": "3" } } ]```
@luisfernandez: yes it’s well most of them
@luisfernandez: these columns are just counts
@luisfernandez: ```"noDictionaryColumns": [ "click_count", "order_count", "impression_count", "cost", "revenue" ],```
@steotia: Also, make sure to add the column in noDictionaryColumns list in the indexingConfig section of the table config ``` "noDictionaryColumns": [ "columnName" ],``` ideally it should not be needed in both places but yea config cleanup is needed
@steotia: I think you just need to setup `fieldConfigList` then
@steotia: What is the type of this column ?
@luisfernandez: type is int
@luisfernandez: for all those columns
@luisfernandez: cool thank you, we are just trying to understand what in particular caused it to have that exception cause it’s a new one to us
@steotia: The v3 especially was introduced since we were hitting 2GB limit on STRING type columns. Since you are hitting this on an INT column, it possibly means that you have 500 million rows in a single segment ?
@steotia: which may not necessarily be optimal
@steotia: btw, v3 will work for both fixed and variable width.. I am just curious that there is a need to use it on INT / fixed width columns
@steotia: cc @richard892
@rblau: :eyes: i think we’re seeing that generally the number of rows in our segments is around 200k, i’d be pretty surprised if one segment had >500mill rows
@richard892: are any of these multi value?
@luisfernandez: none of them
@steotia: seems like a different problem to me then
@steotia: in fact the problem is happening during read / segment load which potentially implies there is no need to bump the version from v2 to v3 because then the segment generation should have ideally failed initially as the overflow would have resulted in a negative capacity (at least that's what I have seen in the past whenever there is a need to go from v2 to v3)
@richard892: I will look in to this on Monday
@prashant.pandey: Hi team. What should be the value of `controller.host` in controller config for a k8s deployment? I am deploying Pinot to a new env and leaving this field empty results in a NPE during controller startup: ```java.lang.NullPointerException: null at org.apache.pinot.common.utils.helix.HelixHelper.updateHostnamePort(HelixHelper.java:550) ~[pinot-all-0.9.3-jar-with-dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at org.apache.pinot.controller.BaseControllerStarter.updateInstanceConfigIfNeeded(BaseControllerStarter.java:607) ~[pinot-all-0.9.3-jar-with-dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at org.apache.pinot.controller.BaseControllerStarter.registerAndConnectAsHelixParticipant(BaseControllerStarter.java:583) ~[pinot-all-0.9.3-jar-with-dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb] at org.apache.pinot.controller.BaseControllerStarter.setUpPinotController(BaseControllerStarter.java:382) ~[pinot-all-0.9.3-jar-with-dependencies.jar:0.9.3-e23f213cf0d16b1e9e086174d734a4db868542cb]```
@walterddr: you can probably use add this to the config ```pinot.set.instance.id.to.hostname=true```
@prashant.pandey: Well this is set, but not sure why it’s not picking it up. Literally same configs in all other envs, and they work fine. I am probably making a very stupid mistake somewhere.
@prashant.pandey: Here’s my controller config: ```apiVersion: v1 data: pinot-controller.conf: |- controller.helix.cluster.name=myenv controller.port=9000 controller.data.dir=/tmp/controller controller.zk.str=apache-pinot-zookeeper-bitnami-headless.pinot.svc.cluster.local:2181 pinot.set.instance.id.to.hostname=true pinot.set.instance.id.to.hostname=true kind: ConfigMap metadata: annotations: : pinot : pinot-controller : kubernetes/configMap : stage-pinot : configMap pinot-controller : "false" name: pinot-controller namespace: pinot```
@prashant.pandey: Oh, I see it’s repeated. Let me try deleting the duplicate line.
@walterddr: did you restart the pod. it should automatically pick up
@prashant.pandey: Wow so I deleted that duplicate line and it picked it up.
@prashant.pandey: I’ll recheck this. Not sure why duplicate configs are behaving like this. Might be a bug.
@prashant.pandey: Thanks @walterddr
@walterddr: np. glad i can help
@rajat.taya: @rajat.taya has joined the channel
@ryan.persaud: @ryan.persaud has joined the channel
@ryan.persaud: :wave: Hello, I am working through the QuickStart Tutorial, and I started pinot locally with the command: `./bin/pinot-admin.sh QuickStart -type batch`. I can see a log entry for the the table being added, and no obvious errors: ```Adding offline table: baseballStats Executing command: AddTable -tableConfigFile /var/folders/jv/g99n5jcj3hz0lbbf90gykcc40000gq/T/1651874628141/baseballStats_1651874628195.config -schemaFile /var/folders/jv/g99n5jcj3hz0lbbf90gykcc40000gq/T/1651874604715/baseballStats/baseballStats_schema.json -controllerProtocol http -controllerHost localhost -controllerPort 9000 -user null -password [hidden] -exec``` but I do not see the table via the UI (please see screenshot). Is there an additional step that I need to take in order to see the table? Thanks! Not sure if it's relevant, but here is some version information: Java: `openjdk 11.0.15 2022-04-19` pinot: `pinot-0.10.0`
@xiaobing: if the cmd went well, the table should show up in UI
@xiaobing: trying this on my side
@ryan.persaud: Since I'm not seeing it, I'm guessing there was an issue adding table. Is there anywhere else to check for logging? I looked in `logs/pinot-all.log` as well, but I see `44293 2022/05/06 16:03:48.195 INFO [BootstrapTableTool] [main] Adding offline table: baseballStats` and no errors/exceptions.
@xiaobing: hmm.. just tried this quickstart on my side (but on latest master branch), looks like things worked as expected, and logs are just emit to console
@xiaobing: I can try it on `pinot-0.10.0` shortly
@xiaobing: for a clean attempt, I downloaded pinot-0.10.0 binary and ran the cmd again. It went well. The sample queries returned with results, and the pinot UI showed the table too. And the logs simply emit to console (pretty verbose actually) ```➜ apache-pinot-0.10.0-bin bin/pinot-admin.sh QuickStart -type batch ... Query : select playerName, runs, homeRuns from baseballStats order by yearID limit 10 Executing command: PostQuery -brokerProtocol http -brokerHost 192.168.0.101 -brokerPort 8000 -queryType sql -query select playerName, runs, homeRuns from baseballStats order by yearID limit 10 Processed requestId=5,table=baseballStats_OFFLINE,segments(queried/processed/matched/consuming)=1/1/1/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=33,resSerMs=0,totalTimeMs=34,minConsumingFreshnessMs=-1,broker=Broker_192.168.0.101_8000,numDocsScanned=97889,scanInFilter=0,scanPostFilter=97919,sched=FCFS,threadCpuTimeNs(total/thread/sysActivity/resSer)=0/0/0/0 requestId=5,table=baseballStats_OFFLINE,timeMs=40,docs=97889/97889,entries=0/97919,segments(queried/processed/matched/consuming/unavailable):1/1/1/0/0,consumingFreshnessTimeMs=0,servers=1/1,groupLimitReached=false,brokerReduceTimeMs=2,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);192.168.0.101_O=0,36,642,0,1,offlineThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,realtimeThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,query=select playerName, runs, homeRuns from baseballStats order by yearID limit 10 playerName runs homeRuns Alfred L. 0 0 Charles Roscoe 66 0 Adrian Constantine 29 0 Robert 9 0 Arthur Algernon 28 0 Douglas L. 28 2 Francis Patterson 0 0 Robert Edward 30 0 Franklin Lee 13 0 William 1 0 *************************************************** You can always go to to play around in the query console ... ➜ apache-pinot-0.10.0-bin java -version openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)```
@ryan.persaud: Interesting, I don't think I'm getting the query output. Is it after all of the bootstrapping has completed?
@xiaobing: yes, after the pinot components got started, table created and sample data ingested into the table.
@ryan.persaud: Did you get an explicit message for the data being ingested into the table?
@xiaobing: e.g. there were logs like ```Executing command: AddTable -tableConfigFile /var/folders/_0/gctvc27x5795n3rb5zh52qm00000gn/T/1651877834470/baseballStats_1651877834521.config -schemaFile /var/folders/_0/gctvc27x5795n3rb5zh52qm00000gn/T/1651877789218/baseballStats/baseballStats_schema.json -controllerProtocol http -controllerHost localhost -controllerPort 9000 -user null -password [hidden] -exec Adding schema: baseballStats with override: true Added schema: baseballStats ... {"status":"Table baseballStats_OFFLINE succesfully added"} ... Uploading a segment baseballStats_OFFLINE_0 to table: baseballStats, push type SEGMENT, (Derived from API parameter) ... Added segment: baseballStats_OFFLINE_0 to IdealState for table: baseballStats_OFFLINE ...```

#getting-started

#introductions

#linen_dev

@slackbot: removed an integration from this channel:
@slackbot: removed an integration from this channel:
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org