Apache Pinot Daily Email Digest (2021-01-20)

Pinot Slack Email Digest Wed, 20 Jan 2021 18:02:39 -0800

#general

@amolskh: @amolskh has joined the channel
@luanmorenomaciel: @luanmorenomaciel has joined the channel
@zac: @zac has joined the channel
@chundong.wang: I remember I was told that transform of post-aggregation (eg `DIV(SUM(subtotal), DISTINCTCOUNT(category_id))` ) would be supported in 0.6.0 but couldn’t find it in the . Does anyone know that?
@chundong.wang: @jackie.jxt Would your changes (, ) enable post-aggregation to be transformed?
@jackie.jxt: @chundong.wang Yes, these 2 should support post-aggregation (transform on aggregated values)
@jackie.jxt: FYI, post-aggregation == transform on aggregated values, there is no concept of transform of post-aggregation
@chundong.wang: ah
@chundong.wang: got it
@xiangyu.peng: @xiangyu.peng has joined the channel
@jxue: @here
@jxue: Hi Pinot Folks, I am coming from Apache Helix project. We would like to hold an Apache Helix meet up. Would like to invite one presenter to talk about Apache Pinot how to use Helix. Anyone interest to give a talk about that?
@zxcware: Hi team, can tenants share hosts and servers? or each should have an exclusive set.
@fx19880617: tenants don’t share same instances
@fx19880617: they should be exclusive unless you deploy multiple instances on same host
@zxcware: Got it. Thanks

#random

@amolskh: @amolskh has joined the channel
@luanmorenomaciel: @luanmorenomaciel has joined the channel
@zac: @zac has joined the channel
@xiangyu.peng: @xiangyu.peng has joined the channel

#troubleshooting

@humengyuk18: Anyone know the reason for this WARNING, HK2 service reification failed for [javax.servlet.ServletConfig] with an exception, is this just a warning we can safely ignore it ?
@fx19880617: you can safely ignore those
@fx19880617: this happens when you open the ui
@humengyuk18: thanks
@humengyuk18: ```WARNING: HK2 service reification failed for [javax.servlet.ServletConfig] with an exception: MultiException stack 1 of 2 java.lang.NoSuchMethodException: Could not find a suitable constructor in javax.servlet.ServletConfig class. at org.glassfish.jersey.inject.hk2.JerseyClassAnalyzer.getConstructor(JerseyClassAnalyzer.java:168) at org.jvnet.hk2.internal.Utilities.getConstructor(Utilities.java:156) at org.jvnet.hk2.internal.ClazzCreator.initialize(ClazzCreator.java:105) at org.jvnet.hk2.internal.ClazzCreator.initialize(ClazzCreator.java:156) at org.jvnet.hk2.internal.SystemDescriptor.internalReify(SystemDescriptor.java:716) at org.jvnet.hk2.internal.SystemDescriptor.reify(SystemDescriptor.java:670) at org.jvnet.hk2.internal.ServiceLocatorImpl.reifyDescriptor(ServiceLocatorImpl.java:441) at org.jvnet.hk2.internal.ServiceLocatorImpl.narrow(ServiceLocatorImpl.java:2287) at org.jvnet.hk2.internal.ServiceLocatorImpl.igdCacheCompute(ServiceLocatorImpl.java:1163) at org.jvnet.hk2.internal.ServiceLocatorImpl.access$400(ServiceLocatorImpl.java:105) at org.jvnet.hk2.internal.ServiceLocatorImpl$8.compute(ServiceLocatorImpl.java:1157) at org.jvnet.hk2.internal.ServiceLocatorImpl$8.compute(ServiceLocatorImpl.java:1154) at org.glassfish.hk2.utilities.cache.internal.WeakCARCacheImpl.compute(WeakCARCacheImpl.java:105) at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetDescriptor(ServiceLocatorImpl.java:1237) at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetInjecteeDescriptor(ServiceLocatorImpl.java:558) at org.jvnet.hk2.internal.ServiceLocatorImpl.getInjecteeDescriptor(ServiceLocatorImpl.java:567) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.lambda$new$0(ContextInjectionResolverImpl.java:81) at org.glassfish.jersey.internal.util.collection.Cache$OriginThreadAwareFuture.lambda$new$0(Cache.java:169) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.glassfish.jersey.internal.util.collection.Cache$OriginThreadAwareFuture.run(Cache.java:225) at org.glassfish.jersey.internal.util.collection.Cache.apply(Cache.java:77) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:95) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:121) at org.glassfish.jersey.server.internal.inject.DelegatedInjectionValueParamProvider.lambda$getValueProvider$0(DelegatedInjectionValueParamProvider.java:67) at org.glassfish.jersey.server.spi.internal.ParamValueFactoryWithSource.apply(ParamValueFactoryWithSource.java:50) at org.glassfish.jersey.server.spi.internal.ParameterValueHelper.getParameterValues(ParameterValueHelper.java:64) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$AbstractMethodParamInvoker.getParamValues(JavaResourceMethodDispatcherProvider.java:109) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80) at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) at org.glassfish.jersey.internal.Errors.process(Errors.java:292) at org.glassfish.jersey.internal.Errors.process(Errors.java:274) at org.glassfish.jersey.internal.Errors.process(Errors.java:244) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679) at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:353) at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) at java.lang.Thread.run(Thread.java:748) MultiException stack 2 of 2 java.lang.IllegalArgumentException: Errors were discovered while reifying SystemDescriptor( implementation=javax.servlet.ServletConfig contracts={javax.servlet.ServletConfig} scope=org.glassfish.jersey.process.internal.RequestScoped qualifiers={} descriptorType=CLASS descriptorVisibility=NORMAL metadata= rank=0 loader=null proxiable=null proxyForSameScope=null analysisName=null id=178 locatorId=0 identityHashCode=308687884 reified=false) at org.jvnet.hk2.internal.SystemDescriptor.reify(SystemDescriptor.java:681) at org.jvnet.hk2.internal.ServiceLocatorImpl.reifyDescriptor(ServiceLocatorImpl.java:441) at org.jvnet.hk2.internal.ServiceLocatorImpl.narrow(ServiceLocatorImpl.java:2287) at org.jvnet.hk2.internal.ServiceLocatorImpl.igdCacheCompute(ServiceLocatorImpl.java:1163) at org.jvnet.hk2.internal.ServiceLocatorImpl.access$400(ServiceLocatorImpl.java:105) at org.jvnet.hk2.internal.ServiceLocatorImpl$8.compute(ServiceLocatorImpl.java:1157) at org.jvnet.hk2.internal.ServiceLocatorImpl$8.compute(ServiceLocatorImpl.java:1154) at org.glassfish.hk2.utilities.cache.internal.WeakCARCacheImpl.compute(WeakCARCacheImpl.java:105) at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetDescriptor(ServiceLocatorImpl.java:1237) at org.jvnet.hk2.internal.ServiceLocatorImpl.internalGetInjecteeDescriptor(ServiceLocatorImpl.java:558) at org.jvnet.hk2.internal.ServiceLocatorImpl.getInjecteeDescriptor(ServiceLocatorImpl.java:567) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.lambda$new$0(ContextInjectionResolverImpl.java:81) at org.glassfish.jersey.internal.util.collection.Cache$OriginThreadAwareFuture.lambda$new$0(Cache.java:169) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.glassfish.jersey.internal.util.collection.Cache$OriginThreadAwareFuture.run(Cache.java:225) at org.glassfish.jersey.internal.util.collection.Cache.apply(Cache.java:77) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:95) at org.glassfish.jersey.inject.hk2.ContextInjectionResolverImpl.resolve(ContextInjectionResolverImpl.java:121) at org.glassfish.jersey.server.internal.inject.DelegatedInjectionValueParamProvider.lambda$getValueProvider$0(DelegatedInjectionValueParamProvider.java:67) at org.glassfish.jersey.server.spi.internal.ParamValueFactoryWithSource.apply(ParamValueFactoryWithSource.java:50) at org.glassfish.jersey.server.spi.internal.ParameterValueHelper.getParameterValues(ParameterValueHelper.java:64) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$AbstractMethodParamInvoker.getParamValues(JavaResourceMethodDispatcherProvider.java:109) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80) at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) at org.glassfish.jersey.internal.Errors.process(Errors.java:292) at org.glassfish.jersey.internal.Errors.process(Errors.java:274) at org.glassfish.jersey.internal.Errors.process(Errors.java:244) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679) at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:353) at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) at java.lang.Thread.run(Thread.java:748)```
@amolskh: @amolskh has joined the channel
@anshu.jalan: @anshu.jalan has joined the channel
@varun.srivastava: @varun.srivastava has joined the channel
@varun.srivastava: Hi @yupeng
@varun.srivastava: I was going through doc . I have 2 query
@varun.srivastava: 1. Can a normal table (non-upsert) have primnary key => then only defining "primaryKeyColumns": ["event_id"] in schema should be fine ??? 2. For upsert table if we have composit primary key like - "primaryKeyColumns": ["event_id", "eventName"] . And kafka partition key is like event_id (made of only one primary key field). Should it be fine ?
@yupeng: 1. Yes, it is ok to have pk defined. In fact pk can be used for other purposes like join. 2. Yes, it’s fine to have coarser grained partitions
@neer.shay: Hi, is there a way to configure the s3 endpoint in the plugin ()? The doc only describes configuration for region, access keys, and ACL
@fx19880617: what do you mean by s3 endpoint?
@fx19880617: from the code, there is a config you can set as `endpoint`
@neer.shay: can you share the link please?
@fx19880617:
@fx19880617: updated this doc
@neer.shay: thanks
@pabraham.usa: Hello, What are the basic steps to troubleshoot a cluster. My cluster status sometimes shows Bad in UI and recovers quickly. However search and ingestion are working. No issues with CPU or Memory . Also all logs looks ok ,other than few errors due to bad query searches. So how to check whether everything is all-right?
@wrbriggs: Real-time segments will sometimes bounce around in Bad status in the UI, especially if you have multiple replicas.
@wrbriggs: That's been my experience, anyway
@wrbriggs: I found this open issue, so I assumed it wasn't worth worrying about:
@pabraham.usa: Thanks . You are correct I do have multiple replicas and using realtime. Great to know that this is a normal behavior and good that a PR is already there.
@g.kishore: That’s right , it’s a minor bug in the UI logic
@pabraham.usa: @g.kishore I can also see the memory usage is growing after my last restart. I expect Pinot to trigger a GC? or is it MMAP . This is my Graph
@pabraham.usa:
@g.kishore: It’s mmap
@g.kishore: What your GC setting
@pabraham.usa: ```jvmOpts: "-Xms512M -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/dev/stdout -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=1 "```
@pabraham.usa: Pod mem is 26G
@g.kishore: How are you getting memory usage?
@g.kishore: That graph looks like system usage not jvm
@pabraham.usa: from kubernetes metrics
@pabraham.usa: Thats correct that Pod is only running single pinot server
@g.kishore: That’s ok - OS will manage pod memory
@g.kishore: Jvm will be under 4 gb
@pabraham.usa: ohh ok, makes sense. but the memory increase is a normal behavior right? and it will purged by OS at some point?
@g.kishore: Yes
@pabraham.usa: Great Thanks :+1: , will monitor and see how it goes.
@g.kishore: Yes, OS is good at managing it...
@luanmorenomaciel: @luanmorenomaciel has joined the channel
@zac: @zac has joined the channel
@xiangyu.peng: @xiangyu.peng has joined the channel

#pinot-k8s-operator

@luanmorenomaciel: @luanmorenomaciel has joined the channel

#pinot-dev

@luanmorenomaciel: @luanmorenomaciel has joined the channel
@luanmorenomaciel: Hi folks, I'm trying to run a ingestion task from Kafka but not getting any output message from the *bin/pinot-admin.sh,* this is what I'm doing *events coming from kafka* ```{ "user_id": 17611, "uuid": "469fe40e-84cf-482c-ba73-fe722596f7bc", "first_name": "Christina", "last_name": "Jones", "date_birth": "1954-03-09", "city": "Thomastown", "country": "Honduras", "company_name": "Nelson, Kline and Munoz", "job": "Drilling engineer", "phone_number": "", "last_access_time": "1994-04-08T07:32:19", "time_zone": "America/Montevideo", "dt_current_timestamp": "2021-01-20 12:05:53.219255" }``` *schema definition* ```{ "schemaName": "sch_users_json", "dimensionFieldSpecs": [ { "name": "user_id", "dataType": "INT" }, { "name": "uuid", "dataType": "STRING", "singleValueField": false }, { "name": "first_name", "dataType": "STRING" }, { "name": "last_name", "dataType": "STRING" }, { "name": "date_birth", "dataType": "STRING" }, { "name": "city", "dataType": "STRING" }, { "name": "country", "dataType": "STRING", "singleValueField": false }, { "name": "phone_number", "dataType": "STRING", "singleValueField": false }, { "name": "last_access_time", "dataType": "STRING", "singleValueField": false }, { "name": "time_zone", "dataType": "STRING", "singleValueField": false } ], "timeFieldSpec": { "incomingGranularitySpec": { "timeType": "MILLISECONDS", "timeFormat": "EPOCH", "dataType": "LONG", "name": "dt_current_timestamp" } } }``` *task creation*
@luanmorenomaciel: ```{ "tableName": "realtime_users_json_events", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "mergedTimeMillis", "timeType": "MILLISECONDS", "retentionTimeUnit": "DAYS", "retentionTimeValue": "60", "schemaName": "sch_users_json", "replication": "1", "replicasPerPartition": "1" }, "tenants": {}, "tableIndexConfig": { "loadMode": "MMAP", "invertedIndexColumns": [ "city", "country" ], "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "src-app-users-json", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "127.0.0.1:9094", "realtime.segment.flush.threshold.time": "3600000", "realtime.segment.flush.threshold.size": "50000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }, "metadata": { "customConfigs": {} } }``` I'm connecting on pinot-controller and executing the following command, but getting any results ```root@pinot-controller-0:/opt/pinot# bin/pinot-admin.sh AddTable \ > -schemaFile /opt/pinot/sch_users_json.json \ > -tableConfigFile /opt/pinot/realtime_users_json_events.json \ > -exec root@pinot-controller-0:/opt/pinot#```
@npawar: Few things could be the issues: 1. your table says “mergedTimeMillis” as the time column, but i dont see that in the schema or data 2. The timeColumn fieldSpec in the schema looks incorrect. You’ve specified EPOCH millis but the dt_current_timestamp looks like it’s in a simple date time format. Also, we recommend using dateTimeFieldSpec now, instead of timesFieldSpec. Refer to this to configure your dateTimeFieldSpec correctly
@npawar: i wonder why you dont see any messages after you run that command though. Can you check the pinotController.log?
@luanmorenomaciel: doing that now @npawar thank you for stepping in, we're validating Pinot over Druid
@luanmorenomaciel: let me check now
@luanmorenomaciel: ```2021/01/20 22:51:03.534 INFO [PinotTableRestletResource] [grizzly-http-server-0] Cannot find valid fieldSpec for timeColumn: mergedTimeMillis from the table config: realtime_users_json_events_REALTIME, in the schema: sch_users_json exception: Cannot find valid fieldSpec for timeColumn: mergedTimeMillis from the table config: realtime_users_json_events_REALTIME, in the schema: sch_users_json```
@luanmorenomaciel: you hit the nail on the head, let me fix it here
@luanmorenomaciel: @npawar this is the new schema definition ```{ "schemaName": "sch_users_json", "dimensionFieldSpecs": [ { "name": "user_id", "dataType": "LONG" }, { "name": "uuid", "dataType": "STRING" }, { "name": "first_name", "dataType": "STRING" }, { "name": "last_name", "dataType": "STRING" }, { "name": "date_birth", "dataType": "STRING" }, { "name": "city", "dataType": "STRING" }, { "name": "country", "dataType": "STRING" }, { "name": "phone_number", "dataType": "STRING" }, { "name": "last_access_time", "dataType": "STRING" }, { "name": "time_zone", "dataType": "STRING" } ], "dateTimeFieldSpec": { "incomingGranularitySpec": { "name": "dt_current_timestamp", "dataType": "STRING", "format": "SIMPLE_DATE_FORMAT" } } }```
@npawar: the dateTimeFieldSpec looks incorrect
@luanmorenomaciel: can you send me an example please?
@npawar: ```"dateTimeFieldSpecs": [ { "name": "millisSinceEpoch", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "15:MINUTES" }, { "name": "hoursSinceEpoch", "dataType": "INT", "format": "1:HOURS:EPOCH", "granularity": "1:HOURS" }, { "name": "dateString", "dataType": "STRING", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" } ]```
@npawar: in your case, you’ll use the 3rd one from this array.
@npawar: but you’ll have to set your right simple date format
@luanmorenomaciel: got it ``` "dateTimeFieldSpecs": [{ "name": "dt_current_timestamp", "dataType": "STRING", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }]```
@npawar: you’ll need some more things after yyyy-MM-dd rt?
@npawar: `2021-01-20 12:05:53.219255`
@npawar: one sec
@npawar:
@luanmorenomaciel: for now I think this is gonna be suffice! :slightly_smiling_face:
@luanmorenomaciel: just trying to run the task and I can adjust this later but I'll keep this in mind for sure
@npawar: i think it will fail, because the input data will have `2021-01-20 12:05:53.219255` and Pinot will try to match it with just `yyyy-MM-dd`
@luanmorenomaciel: hmmm that means I need to adjust great let me check now
@npawar: yyyy-MM-dd HH:mm:ss.SSSSSS
@npawar: try this
@luanmorenomaciel: thank you for the that!! super appreciate ```21/01/20 23:17:37.817 WARN [PartitionCountFetcher] [grizzly-http-server-1] Could not get partition count for topic src-app-users-json org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata 2021/01/20 23:17:37.818 ERROR [PinotTableIdealStateBuilder] [grizzly-http-server-1] Could not get partition count for src-app-users-json org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata 2021/01/20 23:17:37.818 ERROR [PinotTableRestletResource] [grizzly-http-server-1] org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata java.lang.RuntimeException: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata```
@npawar: pinot is not able to access the kafka at 127.0.0.1:9094 . is that url right?
@npawar: are you using docker? you might have to change the host name
@luanmorenomaciel: yeah looking at that , using minikube I can reach through other apps let me check and get back to you
@luanmorenomaciel: do I need to supply any other config than that one here? ``` "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "src-app-users-json", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "127.0.0.1:9094", "realtime.segment.flush.threshold.time": "3600000", "realtime.segment.flush.threshold.size": "50000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest"```
@npawar: no, just change the hostname in this `"stream.kafka.broker.list": "127.0.0.1:9094",` to whatever your kafka broker process is named as
@luanmorenomaciel: perfect checking that now, we're super close!
@npawar: here’s an example for you: . This table config uses “kafka:9092” because kafka process was called “kafka”
@luanmorenomaciel: ```2021/01/20 23:26:15.116 ERROR [PinotTableRestletResource] [grizzly-http-server-0] Failed to fetch the offset for topic: src-app-users-json, partition: 0 with criteria: OffsetCriteria{_offsetType=SMALLEST, _offsetString='smallest'} java.lang.IllegalStateException: Failed to fetch the offset for topic: src-app-users-json, partition: 0 with criteria: OffsetCriteria{_offsetType=SMALLEST, _offsetString='smallest'} at org.apache.pinot.controller.helix.core.realtime.PinotLLCRealtimeSegmentManager.getPartitionOffset(PinotLLCRealtimeSegmentManager.java:643) ~[pinot-all-0.7.0-SNAPSHOT-root@pinot-controller-0:/opt/pinot#```
@luanmorenomaciel: in this case @npawar I'm using ClusterIP ``` "stream.kafka.broker.list": "edh-kafka-0.ingestion.svc.cluster.local:9094"```
@npawar: huh interesting. At least now it’s connecting to kafka.
@npawar: anything else in logs?
@npawar: can you try with “stream.kafka.consumer.prop.auto.offset.reset”: “largest” instead of smallest?
@luanmorenomaciel: actually I think it's my mistake!!! this is the correct address of kafka let me test now ```edh-kafka-brokers.ingestion.svc.Cluster.local:9092```
@luanmorenomaciel: boom seems that is rolling! :slightly_smiling_face: ```2021/01/20 23:30:03.665 INFO [AssignableInstanceManager] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] AssignableInstanceManager built AssignableInstances from scratch based on contexts in TaskDataCache due to Controller switch or ClusterConfig change. 2021/01/20 23:30:03.665 INFO [AssignableInstanceManager] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] Current quota capacity: {"Server_pinot-server-0.pinot-server-headless.datastore.svc.cluster.local_8098":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}},"Controller_pinot-controller-0.pinot-controller-headless.datastore.svc.cluster.local_9000":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}},"Minion_pinot-minion-0.pinot-minion-headless.datastore.svc.cluster.local_9514":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}},"Broker_pinot-broker-0.pinot-broker-headless.datastore.svc.cluster.local_8099":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}}} 2021/01/20 23:30:03.665 INFO [WorkflowControllerDataProvider] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] Event 0592e5c6_TASK : END: WorkflowControllerDataProvider.refresh() for cluster pinot, started at 1611185403620 took 45 for TASK pipeline 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END ReadClusterDataStage for TASK pipeline for cluster pinot. took: 45 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END ResourceComputationStage for TASK pipeline for cluster pinot. took: 0 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END ResourceValidationStage for TASK pipeline for cluster pinot. took: 0 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END CurrentStateComputationStage for TASK pipeline for cluster pinot. took: 0 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END TaskSchedulingStage for TASK pipeline for cluster pinot. took: 0 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.665 INFO [TaskPersistDataStage] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] START TaskPersistDataStage.process() 2021/01/20 23:30:03.665 INFO [TaskPersistDataStage] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END TaskPersistDataStage.process() for cluster pinot took 0 ms 2021/01/20 23:30:03.665 INFO [Pipeline] [HelixController-pipeline-task-pinot-(0592e5c6_TASK)] END TaskPersistDataStage for TASK pipeline for cluster pinot. took: 0 ms for event 0592e5c6_TASK 2021/01/20 23:30:03.666 INFO [AbstractAsyncBaseStage] [TaskJobPurgeWorker-pinot] START AsyncProcess: TASK::TaskGarbageCollectionStage```
@npawar: great!
@luanmorenomaciel: woot woot @npawar kudos if was not you I've sure that I would spend countless hours on that, super appreciate your effort and patience
@luanmorenomaciel: jackpot! :slightly_smiling_face:
@luanmorenomaciel: last question now that you're here @npawar is easy to use avro? in this case work with avro topics integrated with Schema Registry
@luanmorenomaciel: is there any place on the docos where you can point me for this integration?
@npawar: yes it is easy to use avro. I believe many of the companies are using the AvroDecoders and schema registry setup
@npawar: let me look if we have a recipe or that
@npawar: cannot find an end to end integration recipe, but this sample table shows avro and schema related properties:
@npawar: we can also connect you to folks who are using that
@luanmorenomaciel: @npawar I would love that, if you can forward a contact from someone using Schema Registry we will love it
@npawar: @mayanks @elon.azoulay would you be able to help ^^ ?
@elon.azoulay: @elon.azoulay has joined the channel

#announcements

@luanmorenomaciel: @luanmorenomaciel has joined the channel

#pinot-perf-tuning

@ken: Is there a way to have an inverted index for a column, but not store the column data? So a pure filter-only field?
@mayanks: Inv index has dict id to docIds mapping. You need dictionary to store values. This is the current implementation
@g.kishore: Not yet, it’s not hard to go this.. file an issue
@ken: @g.kishore would it make sense to add a “noStorageColumns” config setting for tables?
@g.kishore: yes, something along those lines
@g.kishore: also, add some points on why this feature is important
@g.kishore: is it purely storage on disk? bcos Pinot will not read the forward index if its never accessed in query
@mayanks: Just for my understanding, is this a request for sparse dictionary? I am missing something- don’t we need to have some storage for values to be able to reference them from queries?
@mayanks: Oh, so have the inv index but not the fwd index
@g.kishore: @mayanks there are three things Forward index, dictionary, inverted index
@mayanks: Yeah got it
@g.kishore: what @ken is asking for is not to store the forward index
@mayanks: Yes. It would be great to see how much storage is being used for fwd index in your case @ken. Index_map file inside segment dir has that info.
@ken: I must be missing something, given the above discussion :slightly_smiling_face: In Lucene, you can have a field in an index which only has the terms-to-docIdSet mappings, but without any stored data. Given what you said above, it sounds like the equivalent is to have the forward dictionary (so you have dict ids) and the inverted index (to map from a dict id to a set of doc ids), but no actual data, yes?
@mayanks: What you are referring to as actual data maps to fwd index. That also does not have data, it is not encoded dict ids per docId
@mayanks: To add more detail: ```Dictionary: value to id map Fwd index: for each docId - dictId Inv index: for each dictId -> list of docIds.```
@mayanks: With dictionary encoding, and bit encoding (10 bits can represent 2^10 unique values), you can get compression.
@ken: We have a multi-valued field, so in that case the fwd index is what?
@mayanks: So the question to you is: ```Are you trying to reduce storage cost? If so, the only thing you can eliminate is fwdIndex - let's check its size for your segments.```
@ken: We’re blowing the 2gb limit for a column. So yes, I guess you’d call that a “storage cost” :slightly_smiling_face:
@mayanks: for MV: You can think of docId - [list of dictIds]
@mayanks: For that, you might want to reduce num docs per segment instead.
@ken: When our next build succeeds (where increasing number of segments) I can check the fwd index sizes
@mayanks: Do you have star tree?
@ken: Yes, though not with that column
@mayanks: Ok, then I am curious to know the metadata for that column (cardinality, etc)
@ken: We’ve found that having # of segments <= number of available server threads really helps our query performance, thus the balancing act with segment size
@ken: roughly 300K unique terms
@ken: (it’s a text field that we’re tokenizing/normalizing)
@mayanks: Do you have text index for that column?
@ken: No - it would be huge, and all we really need is term-level filtering
@mayanks: Also, it might not make sense to have dictionary on that column (if you have filters on other columns)
@mayanks: Ok, then explore no-dict index for that column
@mayanks: Ok, once you have the index generated, please share the metadata.properties for that column.
@mayanks: That will help me understand if no-dict or some other index might be better for that column
@ken: OK, thanks
@mayanks: For dict, we pad strings to make them same length, and that could lead to lot of storage wastage.
@mayanks: Metadata will tell us
@ken: wow, yes that would be an issue
@ken: I could throw in a filter to remove long terms, which would also help
@mayanks: no-dict will eliminate padding and hence reduce size. But there is no inv index for no-dict so you would need to rely on setting index on other columns
@mayanks: for high cardinality with uneven string size, no-dict give better overall size
@ken: Right, but sounds like what would be the best match for our use case would be a dictionary + inv index, without the forward index.
@mayanks: I think the wastage from padding in dictionary might be the root cause, and if so, removing fwd index won’t help
@mayanks: Let’s look at the index sizes and metadata once we have that
@ken: Max term length is 20, average term length is 6, so assume 14 bytes/term * 400K terms is 5.6MB
@ken: And yes, agree that examining the metadata is the right next step.
@mayanks: Sounds good
@ken: From metadata.properties: ```column.landingPageText_terms.cardinality = 144997 column.landingPageText_terms.totalDocs = 6100482 column.landingPageText_terms.dataType = STRING column.landingPageText_terms.bitsPerElement = 18 column.landingPageText_terms.lengthOfEachEntry = 45 column.landingPageText_terms.columnType = DIMENSION column.landingPageText_terms.isSorted = false column.landingPageText_terms.hasNullValue = false column.landingPageText_terms.hasDictionary = true column.landingPageText_terms.textIndexType = NONE column.landingPageText_terms.hasInvertedIndex = true column.landingPageText_terms.isSingleValues = false column.landingPageText_terms.maxNumberOfMultiValues = 4984 column.landingPageText_terms.totalNumberOfEntries = 312834131 column.landingPageText_terms.isAutoGenerated = false column.landingPageText_terms.maxValue = \uFF42\uFF49\uFF5A column.landingPageText_terms.defaultNullValue = null```
@ken: And top four columns by size: ```landingPageText_terms.forward_index.size 743576242 destinationUrl.forward_index.size 199861484 creativeText.forward_index.size 99124657 imageUrl.forward_index.size 95375146```
@mayanks: What about inv index and dict size for landingPageText?
@mayanks: Oh it is multi valued?
@ken: yes
@mayanks: My guess inv index might be even bigger
@mayanks: Can you share inv index and dict size?
@ken: So where is inv index size?
@mayanks: Index_map file
@ken: All I’ve got for landingPageText_terms is: ```landingPageText_terms.dictionary.startOffset = 406482808 landingPageText_terms.dictionary.size = 6524873 landingPageText_terms.forward_index.startOffset = 413007681 landingPageText_terms.forward_index.size = 743576242```
@mayanks: No inv index on this column?
@ken: Hmm, says `hasInvertedIndex = true`
@ken: from metadata file
@ken: `column.landingPageText_terms.hasInvertedIndex = true`
@mayanks: Unfortunately it always does
@mayanks: If index map file does not show it and you don’t have it in indexing config, then there is no inv index
@ken: and that column is in the `tableIndexConfig`’s `invertedIndexColumns` list
@mayanks: Hmm
@mayanks: Oh there’s this config to generate inv index offline vs in server during loading
@ken: I didn’t build these segments, someone else at the company did, but I believe the tableIndexConfig matches
@mayanks: But if index map does not have that info then it is not built yet
@ken: ah, right `"createInvertedIndexDuringSegmentGeneration": false,`
@mayanks: Typically inv index size for MV columns might be bigger than fwd index
@ken: Should be a dict id, and a bitset, right?
@ken: (compressed bitset, like RoaringDocIdSet)
@mayanks: Yes
@mayanks: I have seen this pattern in the past, where server OOMs when building inv index of MV columns (2GB limit)
@mayanks: We get around by reducing num docs per segment
@mayanks: Is adding more cores to server not an option?
@ken: Adding more servers is an option, yes. Just trying to figure out bounds on what we can do here.
@ken: But forward index for this column is 750M (out of a total of 886M, for this segment), so getting rid of that would be nice.
@mayanks: Yes, agree, if you definitely need inv index on the column. Otherwise, we need to check which of the two is smaller (fwd vs inv)
@ken: we need to be able to filter using terms, so yes I think the inv index is a requirement
@mayanks: Well, if there are other filters in the query which eliminate a lot of rows, may be not
@ken: Is there documentation on the format of the forward index? I’m also curious how that gets compressed (using Snappy?), if at all.
@mayanks: Uses min number of bits to represent dictIds
@mayanks: There is no additional compression on top of that for dict columns
@ken: yeah, just seems like you’d need an additional table to map from docId to a bit offset into the bit-packed dictIds, and a count of how many dictIds exist for that docId. The Lucene index formats deal with similar issues, and get pretty complex trying to trade off size for lookup speed.
@mayanks: For SIngle value we don’t need not offset, MV yes
@ken: yes, I’m interested in the MV case
@ken: Which source file should I look at, if there’s no documentation?
@mayanks: `FixedBitMVForwardIndexWriter`
@steotia: We recently did this for text index. But didn't remove the forward index completely. The raw text data was huge and was taking up ton of storage. So we stored dummy value in fwd index dictionay encoded
@steotia: This was much easier than changing the semantics completely by not having the fwd index physically
@ken: Thanks @steotia I guess I could look into that and see how hard it would be to do the same thing for an arbitrary column, given a table config setting.
@ken: @steotia - where exactly in the code are you writing out a dummy fwd index for text columns?
@steotia: @ken see this PR
@steotia: This doesn' go all the way in not having the fwd index physically I am interested in seeing how we can possibly not have the fwd index at all and if it is worth it or not given that with the above change storage overhead is significantly reduced
@g.kishore: isn't it a matter of having a empty forward Index reader impl?

#getting-started

@zac: @zac has joined the channel
@zac: Hey folks - i'm trying to get the jdbc client working but running into an issue: ```java.lang.NoClassDefFoundError: org/apache/pinot/client/JsonAsyncHttpPinotClientTransportFactory``` i've tried running both v0.6.0 () and 0.5.0 (version ) but both produce the same error. I've also tried compiling the jar from source, as well as including it as an explicit dependency in maven. Any help is appreciated, thanks!
@fx19880617: can you try to put pinot-java-client.jar into your classpath as well?
@g.kishore: @kharekartik ^^
@kharekartik: @kharekartik has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-01-20)

#general

#random

#troubleshooting

#pinot-k8s-operator

#pinot-dev

#announcements

#pinot-perf-tuning

#getting-started

Reply via email to