Apache Pinot Daily Email Digest (2021-03-18)

Pinot Slack Email Digest Thu, 18 Mar 2021 19:00:33 -0700

#general

@virtualandy: @virtualandy has joined the channel
@osman: @osman has joined the channel
@ravi.maddi: Hi All, one basic doubt, I run quick start stream, I understand the all the ports and components behind that, I am not able to understand about *2191. what is running with 2191 port?*
@g.kishore: Zookeeper. Please use <#C011C9JHN7R|troubleshooting> channel
@joshhighley: Will upsert work with hybrid tables? Will a realtime record become active over an offline record having the same primary key value?
@mayanks: Currently, upsert support is limited to real-time tables only.
@jackie.jxt: No. The plan is to support uploading segments to real-time table, and it is a work in progress

#random

@virtualandy: @virtualandy has joined the channel
@osman: @osman has joined the channel

#feat-text-search

@brianolsen87: @brianolsen87 has joined the channel

#feat-presto-connector

@brianolsen87: @brianolsen87 has joined the channel

#troubleshooting

@phuchdh: hi team. Is there anyway to check minion task: `RealtimeToOfflineSegmentsTask` status or error message ? I’m find warn log in brokers-server. But cannot find any log information in minion-server. Is this log relations to this task ? ```2021/03/17 09:26:49.639 WARN [TimeBoundaryManager] [HelixTaskExecutor-message_handle_thread] Failed to find segment with valid end time for table: RuleLogsUAT_OFFLINE, no time boundary generated 2021/03/17 09:27:06.989 WARN [BaseBrokerRequestHandler] [jersey-server-managed-async-executor-0] Failed to find time boundary info for hybrid table: RuleLogsUAT```
@npawar: The messages will be either in the controller or minion log. Broker messages will not be related to the minion task
@mayanks: The log you posted above happens when the endTime in the segment zk metadata is <= 0:
@mayanks:
@phuchdh: I’m found the errors logs in controller-logs ```2021/03/17 10:34:53.192 WARN [ZkClient] [TaskJobPurgeWorker-pinot-quickstart] Failed to delete path /pinot-quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349! org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /pinot-quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 10:35:07.504 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(a699ebbf_TASK)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 10:35:07.517 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-quickstart] Job TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 exists in JobDAG but JobConfig is missing! Job might have been deleted manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left in the DAG due to a failed clean-up attempt from last purge. 2021/03/17 10:35:07.607 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(fa8a46d1_TASK)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 10:35:07.796 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(329800c1)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 10:35:07.878 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(6e9abcf3_TASK)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'realtime.segment.flush.threshold.rows' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.consumer.prop.group.id' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.decoder.class.name' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'streamType' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'realtime.segment.flush.segment.size' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.consumer.type' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.broker.list' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'realtime.segment.flush.threshold.time' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.consumer.prop.auto.offset.reset' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.consumer.factory.class.name' was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.topic.name' was supplied but isn't a known config. 2021/03/17 11:04:53.201 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(4edd07a9)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 11:30:46.186 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(597f559b_TASK)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 11:30:46.361 WARN [ZkClient] [TaskJobPurgeWorker-pinot-quickstart] Failed to delete path /pinot-quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352! org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /pinot-quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352 2021/03/17 11:30:46.406 WARN [ZkBaseDataAccessor] [HelixController-pipeline-task-pinot-quickstart-(27f5421b_TASK)] Fail to read record for paths: {/pinot-quickstart/INSTANCES/Server_pinot-server-0.pinot-server-headless.analytics.svc.cluster.local_8098/MESSAGES/49538041-4770-474f-b8b3-a414e638244f=-101} 2021/03/17 11:30:46.486 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(27f5421b_TASK)] Job configuration is NULL for TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 2021/03/17 11:30:46.515 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-quickstart] Job TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352 exists in JobDAG but JobConfig is missing! Job might have been deleted manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left in the DAG due to a failed clean-up attempt from last purge. 2021/03/17 11:30:53.066 WARN [TopStateHandoffReportStage] [HelixController-pipeline-default-pinot-quickstart-(aafda2de_DEFAULT)] Event aafda2de_DEFAULT : Cannot confirm top state missing start time. Use the current system time as the start time. 2021/03/17 11:30:53.159 WARN [TopStateHandoffReportStage] [HelixController-pipeline-default-pinot-quickstart-(50b48792_DEFAULT)] Event 50b48792_DEFAULT : Cannot confirm top state missing start time. Use the current system time as the start time. 2021/03/17 11:32:55.353 WARN [SegmentCompletionFSM_RuleLogs__0__0__20210316T1130Z] [grizzly-http-server-0] COMMITTER_NOTIFIED:Aborting FSM (too late) instance=Server_pinot-server-2.pinot-server-headless.analytics.svc.cluster.local_8098 offset=17939 now=1615980775353 start=1615980646028```
@phuchdh: my Realtime Table Config ```{ "REALTIME": { "tableName": "RuleLogs_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "replication": "2", "replicasPerPartition": "2", "timeColumnName": "created_at_days_epoch", "schemaName": "RuleLogs" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} }, "tableIndexConfig": { "sortedColumn": [ "campaign_id", "rule_id" ], "loadMode": "MMAP", "invertedIndexColumns": [ "user_id", "device_id" ], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "streamConfigs": { "streamType": "kafka", "stream.kafka.topic.name": "xxx", "stream.kafka.broker.list": "confluent-cp-kafka-headless.kafka.svc.cluster.local:9092", "stream.kafka.consumer.prop.group.id": "c1.promotion", "stream.kafka.consumer.type": "lowLevel", "stream.kafka.consumer.prop.auto.offset.reset": "smallest", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.segment.size": "200M" }, "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false }, "metadata": { "customConfigs": {} }, "quota": {}, "task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "collectorType": "concat", "bucketTimePeriod": "1d", "bufferTimePeriod": "1h", "maxNumRecordsPerSegment": "1000000" } } }, "routing": {}, "query": {}, "ingestionConfig": { "transformConfigs": [ { "columnName": "device_id", "transformFunction": "jsonPathString(extra_data, '$.user_attributes.audience.device_id')" }, { "columnName": "service_code", "transformFunction": "jsonPathString(extra_data, '$.service_code')" }, { "columnName": "event", "transformFunction": "jsonPathString(extra_data, '$.event')" }, { "columnName": "created_at_days_epoch", "transformFunction": "toEpochDays(created_at_ts)" } ] }, "isDimTable": false } }```
@npawar: @jackie.jxt any idea regarding this error? It's coming from the task framework I think?
@jackie.jxt: Yes, the error is from Helix
@jackie.jxt: @phuchdh Did you delete any task through controller API? ```2021/03/17 10:35:07.517 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-quickstart] Job TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349 exists in JobDAG but JobConfig is missing! Job might have been deleted manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left in the DAG due to a failed clean-up attempt from last purge.```
@jackie.jxt: Also, the `WARN` from broker means either the offline segment has invalid end time, or there is no offline segment. Can you check if any segment is pushed to the offline table?
@phuchdh: i don’t delete any task through controller API.
@phuchdh: In my scenario. I config my realtime job segments create in 1 days, then i except config segment task convert realtime table to offline table.
@phuchdh: here is some segments yesterday & today.
@phuchdh: but when i check segments created yesterday, it’s status is OFFLINE. So i think i have problem with realtime segments to offline segments task
@phuchdh:
@phuchdh: but another partition was mark by status ONLINE
@ali: @ali has left the channel
@virtualandy: @virtualandy has joined the channel
@osman: @osman has joined the channel
@ravi.maddi: ** I am trying to push to kafka, I am not able to get any thing as response. And data not appearing in the consumer console also. *Please help me* how to trace it , where can I found logs. is there any *options to add to my command* to see more detailed log. I am running this command: ```bin/kafka-console-producer.sh --broker-list localhost:19092 --topic mytopic opt_flatten_json.json``` I am getting output this only: ```>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>```
@fx19880617: for console producer, you need to type the data then press ENTER to send one message
@fx19880617: Please read this one:
@fx19880617: also this:
@pabraham.usa: Hello, I have a JSON data `log` and want to extract values based on keys (`urlpath`). So tried to use JSONIndex however fails during parsing. So ingested it as normal string and tried JSONEXTRACTSCALAR/*json_extract_scalar* however this also fails during parsing. Finally I ended up using Groovy function like `GROOVY('{"returnType": "STRING", "isSingleValue": true}', 'java.util.regex.Pattern p = java.util.regex.Pattern.compile("(\"urlpath\":\")((?:\\\"|[^\\\"]*))"); java.util.regex.Matcher m = p.matcher(arg0); if(m.find()){ return m.group(2); } else { return "";}',log)` and this works in SQL. Now I want to add this Groovy function inside table config to do ingestionTransform to define a new columnName. Is this possible? For ingestion transform can we do multi line , semi colon separated script?

#pinot-dev

@brianolsen87: @brianolsen87 has joined the channel
@ken: If I need to determine the number of groups from an aggregation query, where the groups are filtered by aggregation result, are there any recommended approaches? E.g. group by minute, sum page views, and I only care about minutes with > 1000 page views - what’s a good way to determine the number of interesting minutes? Assume there can be many (e.g. > 1M groups) for my specific use case, so I can’t do an order by with some large limit.
@mayanks: Are you referring to the `Having` clause? If so, that is supported.
@ken: I can use `having <filter>` to restrict groups, but is there a way to count the number of groups without returning the groups (and using some arbitrarily huge limit)?
@ken: (side note - what page is `having` documented on?
@mayanks: `Having` follows SQL syntax so might not be explicitly documented. But that brings up a good point, perhaps we should catalogue what is supported and what isn't (from SQL).
@mayanks: I am unsure if there's a better way to do what you want other than `Having`. Perhaps the `Having` implementation can be optimized to do filtering on server side (pre combine) if the query allows for it (for example if a monotonically increasing aggr function has a filter).
@ken: If the data is segmented by the group key then I would imagine server-side filtering is a possible optimization, otherwise I think you don’t know whether the aggregation results of the gather phase would pass the filter until all results have been combined from all servers. And that means an unbounded amount of memory for the priority map (or whatever is used to collect the results).
@mayanks: I was referring to cases like count(*) or sum (with +ve numbers) and filters like xxx > yyy where you can safely do filtering in scatter phase. Agreed though, you can't always do that.

#getting-started

@virtualandy: @virtualandy has joined the channel

#segment-write-api

@npawar: should we have another meeting tomorrow?
@yupeng: ok for me
@chinmay.cerebro: :thumbsup:
@yupeng: will 3:30-4 work for you @npawar
@npawar: yes works
@fx19880617:
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]