<h3><u>#general</u></h3><br><strong>@yupeng: </strong>hi team, I’d like to hear
your thoughts on this issue
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMSfW2QiSG4bkQpnpkSL7FiK3MHb8libOHmhAW89nP5XKI1pSQGgFuD-2BXBsUHb4ZzZg-3D-3DoJEQ_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3B-2BdbJRgm52iUl5-2BOEKMeImBnI9seGQ0e7ikAb1-2BXiLVmP16ZmI8-2BRAuxE-2FsFy3uyqK4XUN6Afz9tpiagKKnnznHqcpzNYB7-2FZuyNc9zWI3x-2BEA-2BS90E8d8uonDadZpqW7OrCDMB6JbWzNqF4RaHygo-2B-2Ffi9p1bzFcGSG0pJenngk-3D><br><h3><u>#troubleshooting</u></h3><br><strong>@quietgolfer:
</strong>Hi. I'm getting the following error related to my realtime table
ingesting events from kafka. `insertions` is just a long metric column. It's
not a dimension. I'm not sure why it's being hitting this.
```Metrics aggregation cannot be turned ON in presence of dictionary encoded
metrics, eg: insertions```
<br><strong>@steotia: </strong>For realtime metrics aggregation, all metric
columns should be noDictionary<br><strong>@mayanks: </strong>This
^^<br><strong>@mayanks: </strong>Also dimension columns cannot be no
dict<br><strong>@steotia: </strong>we should probably consider adding support
for realtime metrics aggregation with dictionary encoded metric columns and
remove this restriction unless it is infeasible<br><strong>@g.kishore:
</strong>we can but it inefficient<br><strong>@quietgolfer: </strong>I don't
think these metrics need to be dictionary encoded. I assumed metric columns by
default would not be dictionary encoded.<br><strong>@steotia: </strong>all
columns by default are dictionary encoded AFAIK<br><strong>@quietgolfer:
</strong>Looks like the docs say that. I'll update my config.
Thanks!<br><strong>@quietgolfer: </strong>Separate topic, my `pinot-server-0`
log is showing that my params in `streamConfig`
were `supplied but isn't a known config` . Any ideas?
```2020/06/26 17:26:05.729 WARN [ConsumerConfig]
[HelixTaskExecutor-message_handle_thread] The configuration
'stream.kafka.hlc.zk.connect.string' was supplied but isn't a known config.
2020/06/26 17:26:05.729 WARN [ConsumerConfig]
[HelixTaskExecutor-message_handle_thread] The configuration
'realtime.segment.flush.threshold.size' was supplied but isn't a known config.
2020/06/26 17:26:05.737 WARN [ConsumerConfig]
[HelixTaskExecutor-message_handle_thread] The configuration
'stream.kafka.decoder.class.name' was supplied but isn't a known config.
2020/06/26 17:26:05.737 WARN [ConsumerConfig]
[HelixTaskExecutor-message_handle_thread] The configuration 'streamType' was
supplied but isn't a known config.
...```
Here is what my table config looks like (with some repetitive parts removed)
```
{
"tableName": "metrics",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "timestamp",
"timeType": "MILLISECONDS",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "5",
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "metrics",
"replication": "1",
"replicasPerPartition": "1"
},
"tableIndexConfig": {
"loadMode" : "MMAP",
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": [
"platform_id",
... my other dimensions
],
"skipStarNodeCreationForDimensions": [
],
"functionColumnPairs": [
"SUM__insertions",
... my other metrics
]
}
],
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.consumer.type": "simple",
"stream.kafka.topic.name": "metrics-realtime",
... all of the configs are ignored
}
},
"tenants": {},
"metadata": {
"customConfigs": {}
}
} ```<br><strong>@npawar: </strong>those are coming from the kafka consumer.
because we put the entire streamConfigs map from pinot into the consumer
configs. those should be harmless.<br><strong>@g.kishore: </strong>a good
beginner task to fix<br><strong>@quietgolfer: </strong>Cool. I was hitting an
issue where my Pinot server temporarily stopped ingesting events from Kafka and
was trying to figure out if there was an issue. It's working again and still
logging those lines.<br><strong>@pradeepgv42: </strong>Hi, I am having trouble
querying the recent data from Pinot, `select * from <tablename> order by
timestamp limit 10` returns empty.
I also see some of the old segments to be in error state from the broker logs
```
Resource: <tablename>_REALTIME, partition:
<tablename>__7__0__20200625T1913Z is in ERROR state
Resource: <tablename>_REALTIME, partition:
<tablename>__8__0__20200625T1913Z is in ERROR state
Resource: <tablename>_REALTIME, partition:
<tablename>__9__0__20200625T1913Z is in ERROR state ```
<br><strong>@pradeepgv42: </strong>Wondering how do I go ahead about debugging
this?<br><strong>@mayanks: </strong>The server log should show you what the
segment went into error state.<br><strong>@pradeepgv42: </strong>Above is from
server logs, for example: `<tablename>__9__0__20200625T1913Z` is one of
the segment that went into error state (though it’s an older
segment)<br><strong>@pradeepgv42: </strong>I don’t see any logs on recent
segments<br><strong>@mayanks: </strong>Can you grep for one of the segments in
error state in all the server logs (if you have separate logs per day)? It
should show some error/exeception that caused this to
happen.<br><strong>@mayanks: </strong>Alternatively, you can try to re-start
the server and see if the segment still gets into ERROR
state.<br><strong>@pradeepgv42: </strong>tried restarting, seems like a helix
issue let me dig a bit more<br><strong>@mayanks: </strong>Are these old
segments still within your retention period?<br><strong>@pradeepgv42:
</strong>yeah these are within ttl<br><strong>@pradeepgv42: </strong>for exmple
one of the exceptions I see in the server logs
```
Sleep for 10000ms as service status has not turned GOOD:
IdealStateAndCurrentStateMatchServiceStatusCallback:partition=<tablename>__57__0__20200625T191
3Z, expected=ONLINE, found=OFFLINE, creationTime=1593202611703,
modifiedTime=1593202650032, version=95, waitingFor=CurrentStateMatch,
resource=<tablename>_REALTIME, numResourcesLeft=1, numTotalResources=1,
minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;
Caught exception in state transition from OFFLINE -> ONLINE for resource:
<tablename>_REALTIME, partition: <tablename>__57__0__20200625T1913Z
java.lang.RuntimeException:
org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed
after 3 attempts
at
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadAndReplaceSegment(RealtimeTableDataManager.java:286)
~[pinot-all-0
.4.0-jar-with-dependencies.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:252)
~[pinot-all-0.4.0-jar-with-d
ependencies.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:132)
~[pinot-all-0.4.0-jar-with
-dependencies.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164)
[pinot-all-0.4.0-jar-with-dependencies.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
~[?:?]```
<br><strong>@g.kishore: </strong>did you setup deep store for
Pinot?<br><strong>@g.kishore: </strong>also in the
controller<br><strong>@pradeepgv42: </strong>yup I did<br><strong>@pradeepgv42:
</strong>For example
`<https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYXS2B8cms5oWW28KNpfKTsh6piBjoaaV6QhCqBmE5sUHhBye58ZrKlaiDRtBPSCCE51v5XZeDfyD1S3hnoqXdIliQR6Y2Wkb5R19TUhRhwsEyU8OLfwrN29kq0O3r-2Bfqsg-3D-3DmbVl_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3B5wgxt0j2oUIvMqFLGxCCaE3faCEtG5UgaGxNZXyMK89C4ka8ViFiCiDJ2KZjVrUBtjQ1OeZ2u4NwK4mAgyLmvogCYie46Ttl3GizNj0ntXzSzLyLuB4a6Dg8nQmuNM-2FTrtE6-2FcZjFg1gVbYJh36XiEEAxGuj-2F5Qh8JrO6NdC7EM-3D>
/segments/{tableName}/servers` api returns segments distributed across two
servers (currently i have tagged one as realtime and other as
offline)<br><strong>@g.kishore: </strong>does it print the segment uri in the
log?<br><strong>@g.kishore: </strong>also whats the metadata for that segment
```curl -X GET --header 'Accept: application/json'
'<https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYV9TJa4jsQAayVdH8o3CuHA-3DufqM_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3BtV43ptDuEpRu0SmGJnECS1162XrXff5o5PXcbx-2FBJUrpMo4gzgq-2BCSVEJKcx3RRoDY3LPLlL1gYtF3J-2BRbHnjrQbqaWyAN0PXieW34-2B1hmnbeY9WQ87f-2BAGrneyXg-2BpixnORagv0oqXr53tIJPc8mnX2MqBxBemqNe8JoO72H58-3D>:port/segments/<tableName>/<segment_name>/metadata'```<br><strong>@pradeepgv42:
</strong>I don’t actually see any segments downloaded into the offline tagged
machine, I guess swagger api was just showing me the ideal state.
```
Got temporary error status code: 500 while downloading segment from:
https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYfnYZT1b59-2BCtX0F9uq9RLCpM3QBUUVUhd7oXmBCkx6bUPzoO-2BEuohDBm80ug6n8nczMpCuuzrJzIJpk-2FSWGG4laQgi-2FvazT-2BQXkB25D6W2zFP-R_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3BBOv4rL11kf0h5vxlzrq4GWdlRnsD8c5I3eS8f2jEv8SE0juBbavvVseIrYS-2FBZ5e4lMCWP5eA0pxhnZzWGuf3U5-2FLlGrjmu6qc0n5jKVsVuOw4uC-2Fpsh8zv82dUsnaQ2RTt1sr6JQHGCSqag7vXC1700IJbfCp5dFw8UO9TBMOA-3D
3Z to:
/home/ubuntu/pinot/data/<table>_REALTIME/<table>__60__0__20200625T1913Z.tar.gz
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status
code: 500 (Internal Server Error) with reason: "Failed to read response into
file:
/home/ubuntu/data/fileDownloadTemp/<table>/<table>__60__0__20200625T1913Z-1386265827082994"
while sending request:
https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYfnYZT1b59-2BCtX0F9uq9RLCpM3QBUUVUhd7oXmBCkx6bUPzoO-2BEuohDBm80ug6n8nczMpCuuzrJzIJpk-2FSWGG4mNX7bYCzJS-2FSY86ntWpwCHKSKL_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3Bx-2Fs-2B4EsdJ8q9cic48NJoDVOK69SP-2FXQaXQd4Cs7TP9BZJpDgW4CQZX0ebtsxlIqOvLFdtFzkq7GT07oEVuiWypdcrBhYU56bRghoMmdSmJu7HulfFt7Gk8BjEzUnJrZjnoj5xxkGqvq-2Fph07mA8XYeZqd8dED-2BG2gPIyHbG2xW8-3D
to controller: <ip>, version: Unknown```
<br><strong>@pradeepgv42: </strong>```{
"segment.realtime.endOffset": "78125",
"segment.time.unit": "MILLISECONDS",
"segment.start.time": "1593031658266",
"segment.flush.threshold.size": "78125",
"segment.realtime.startOffset": "0",
"segment.end.time": "1593080340734",
"segment.total.docs": "78125",
"segment.table.name": "<table>_REALTIME",
"segment.realtime.numReplicas": "1",
"segment.creation.time": "1593112417336",
"segment.realtime.download.url":
"https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYfnYZT1b59-2BCtX0F9uq9RLCpM3QBUUVUhd7oXmBCkx6bUPzoO-2BEuohDBm80ug6n8nbCWJjQacoXVEVgtCZ4nN8HGZXplD0mo-2Fq3hL7XdhISxY5f-_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTzxuE24xTazj5IF8jydOi3B0PR9K5-2FYxCy6Arm3bi4K5yhyUegd8Y-2Bc5vaxOVFihBPZScptmO-2BmWCIri8985ptaxkmUeS0BAX2Ms7BRv7EEadkzED6FOGD0aZSJ6W-2BiCOHkdXK59txRV3GKRoRDJzoPpEsawJwk9YlTCUfPx6WZ41RjyPso5-2FzTqnMwufqFHAM-3D
"segment.name": "<table>__57__0__20200625T1913Z",
"segment.index.version": "v3",
"custom.map": null,
"segment.flush.threshold.time": null,
"segment.type": "REALTIME",
"segment.crc": "4231660115",
"segment.realtime.status": "DONE"
}```<br><strong>@pradeepgv42: </strong>is the download url supposed to be
something else?<br><strong>@g.kishore: </strong>it should be a valid url
pointing to your s3<br><strong>@pradeepgv42: </strong>I do see the segments in
S3 though<br><strong>@pradeepgv42: </strong>let me double check my
config<br><strong>@g.kishore: </strong>but that url seems to be pointing to
controller<br><strong>@pradeepgv42: </strong>yup<br><strong>@g.kishore:
</strong>and not s3<br><strong>@g.kishore: </strong>that means segments were
getting uploaded to controller<br><strong>@g.kishore: </strong>which is the
default<br><strong>@pradeepgv42: </strong>weird, I see the exact segment being
present in S3<br><strong>@pradeepgv42: </strong>and some of the latest segments
are getting uploaded into S3 too<br><strong>@pradeepgv42: </strong>I found this
in the controller logs
```
Could not get directory entry for
s3://<bucket>/<dir>/<table>/<table>__57__0__20200625T1913Z
Copy
/home/ubuntu/data/fileUploadTemp/<table>__57__0__20200625T1913Z.0943bed4-49e7-4bec-999b-35e9802b3d73
from local to
s3://<bucket>/<dir>/<table>/<table>__57__0__20200625T1913Z
Processing segmentCommitEnd(Server_<ip>_8098, 78125)
Committing segment <table>__57__0__20200625T1913Z at offset 78125 winner
Server_<ip>_8098
Committing segment metadata for segment: <table>__57__0__20200625T1913Z```
<br><strong>@pradeepgv42: </strong>Okay, I think I got a rough picture of
what’s going on, on the REALTIME server,
helix state for the segment was changed from ONLINE to OFFLINE and then to
DROPPED
and parallely on the OFFLINE server segment state was changed from OFFLINE to
ONLINE
REALTIME server part seemed to have executed through but the OFFLINE part is
stuck because of the S3 issue
1. Still not sure why segment url is set to `controller ip`
2. IIUC Does having 1 replica imply, some segments might be not available for
querying when they are moving across servers? <br>