DixitThinkbiz opened a new issue, #15285:
URL: https://github.com/apache/pinot/issues/15285
## Issue: SegmentGenerationAndPushTask Fails with FileNotFoundException
### Summary
We are attempting to use S3 as deep storage for our Apache Pinot deployment.
During the execution of the SegmentGenerationAndPushTask, the task fails with a
`FileNotFoundException` indicating that a segment tar file could not be found.
This appears to occur when the minion attempts to push a generated segment.
### Error Log
```
2025/03/17 10:03:01.581 ERROR [TaskFactoryRegistry]
[TaskStateModelFactory-task_thread-2] Caught exception while executing task:
Task_SegmentGenerationAndPushTask_c870e94e-5051-4ca5-a9ca-31ab89c1118c_1742205780838_0
java.lang.RuntimeException: Failed to execute SegmentGenerationAndPushTask
at
org.apache.pinot.plugin.minion.tasks.segmentgenerationandpush.SegmentGenerationAndPushTaskExecutor.executeTask(SegmentGenerationAndPushTaskExecutor.java:128)
...
Caused by: java.io.FileNotFoundException:
/employee_attendance_OFFLINE_1742205781534_1742205781534_0_741c8aac-0020-4e69-aaa1-75e880255b68.tar.gz
(No such file or directory)
```
### Environment
- **Apache Pinot Version:** 1.2.0
- **Deep Storage:** S3 (configured with S3PinotFS)
- **Kafka Topic:** `kafka_pinot_poc.public.employee_attendance.transformed`
- **Additional Info:** Using Minio for S3 endpoint in the server
configuration.
### Reproduction Steps
1. **Schema & Table Configurations:**
Set up the schema and both realtime and offline table configurations as
specified in our configuration files.
2. **Service Setup:**
Deploy the Controller, Server, and Minion using the provided
configuration files (refer to sections below).
3. **Task Execution:**
The `SegmentGenerationAndPushTask` is triggered (as scheduled) but fails
with a `FileNotFoundException` for the expected tar file.
### Configuration Details
#### Schema Configuration
```js
const pinotSchemaAttendance = {
schemaName: "employee_attendance",
dimensionFieldSpecs: [
{ name: "attendance_id", dataType: "INT" },
{ name: "employee_id", dataType: "INT" },
],
dateTimeFieldSpecs: [
{
name: "punch_time",
dataType: "TIMESTAMP",
format: "1:MILLISECONDS:EPOCH",
granularity: "1:MILLISECONDS",
},
],
primaryKeyColumns: ["attendance_id"],
};
```
#### Realtime Table Configuration
```js
const pinotTableConfigAttendanceRealtime = {
tableName: "employee_attendance_REALTIME",
tableType: "REALTIME",
segmentsConfig: {
schemaName: "employee_attendance",
replication: "1",
retentionTimeUnit: "DAYS",
retentionTimeValue: "15",
replicasPerPartition: "1",
minimizeDataMovement: false,
timeColumnName: "punch_time",
},
tenants: {
broker: "DefaultTenant",
server: "DefaultTenant",
tagOverrideConfig: {},
},
tableIndexConfig: {
invertedIndexColumns: [],
noDictionaryColumns: [],
streamConfigs: {
streamType: "kafka",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.topic.name":
"kafka_pinot_poc.public.employee_attendance.transformed",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.broker.list": "192.168.1.120:9092",
"realtime.segment.flush.threshold.rows": "10",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
},
loadMode: "MMAP",
onHeapDictionaryColumns: [],
varLengthDictionaryColumns: [],
enableDefaultStarTree: false,
enableDynamicStarTreeCreation: false,
aggregateMetrics: false,
nullHandlingEnabled: false,
rangeIndexColumns: [],
rangeIndexVersion: 2,
optimizeDictionary: false,
optimizeDictionaryForMetrics: false,
noDictionarySizeRatioThreshold: 0.85,
autoGeneratedInvertedIndex: false,
createInvertedIndexDuringSegmentGeneration: false,
sortedColumn: [],
bloomFilterColumns: [],
},
metadata: {},
quota: {},
task: {
taskTypeConfigsMap: {
RealtimeToOfflineSegmentsTask: {
bucketTimePeriod: "1h",
bufferTimePeriod: "2h",
mergeType: "concat",
maxNumRecordsPerSegment: "100000",
schedule: "0 * * * * ?",
},
},
},
routing: {},
query: {
timeoutMs: 60000,
},
ingestionConfig: {
continueOnError: false,
rowTimeValueCheck: false,
segmentTimeValueCheck: true,
},
isDimTable: false,
};
```
#### Offline Table Configuration
```js
const pinotTableConfigAttendanceOffline = {
tableName: "employee_attendance_OFFLINE",
tableType: "OFFLINE",
segmentsConfig: {
schemaName: "employee_attendance",
replication: "1",
replicasPerPartition: "1",
timeColumnName: "punch_time",
minimizeDataMovement: false,
segmentPushType: "APPEND",
segmentPushFrequency: "HOURLY",
},
tenants: {
broker: "DefaultTenant",
server: "DefaultTenant",
},
tableIndexConfig: {
invertedIndexColumns: [],
noDictionaryColumns: [],
rangeIndexColumns: [],
rangeIndexVersion: 2,
createInvertedIndexDuringSegmentGeneration: false,
autoGeneratedInvertedIndex: false,
sortedColumn: [],
bloomFilterColumns: [],
loadMode: "MMAP",
onHeapDictionaryColumns: [],
varLengthDictionaryColumns: [],
enableDefaultStarTree: false,
enableDynamicStarTreeCreation: false,
aggregateMetrics: false,
nullHandlingEnabled: false,
optimizeDictionary: false,
optimizeDictionaryForMetrics: false,
noDictionarySizeRatioThreshold: 0.85,
},
metadata: {},
quota: {},
routing: {},
query: {},
ingestionConfig: {
batchIngestionConfig: {
segmentIngestionType: "APPEND",
segmentIngestionFrequency: "DAILY",
batchConfigMaps: [
{
"input.fs.className":
"org.apache.pinot.plugin.filesystem.S3PinotFS",
"input.fs.prop.region": "ap-northeast-1",
"input.fs.prop.endPoint":
"https://s3.ap-northeast-1.amazonaws.com",
"input.fs.prop.accessKey": "****",
"input.fs.prop.secretKey": "*****",
outputDirURI: "s3://ses-email-receiving-bucket-testing/",
inputDirURI: "s3://ses-email-receiving-bucket-testing/",
includeFileNamePattern: "glob:**/*.csv",
inputFormat: "csv",
},
],
},
},
task: {
taskTypeConfigsMap: {
SegmentGenerationAndPushTask: {
inputDirURI: "s3://bucket-name/",
outputDirURI: "s3://bucket-name/",
inputFormat: "csv",
schedule: "0 */1 * * * ?",
},
MergeRollupTask: {
"1hour.mergeType": "rollup",
"1hour.bucketTimePeriod": "1h",
"1hour.bufferTimePeriod": "3h",
"1day.mergeType": "rollup",
"1day.bucketTimePeriod": "1d",
"1day.bufferTimePeriod": "1d",
"CDR_COUNT.aggregationType": "sum",
"DURATION.aggregationType": "sum",
"VOLUME.aggregationType": "sum",
},
},
},
metadata: {
customConfigs: {},
},
};
```
#### Controller Configuration (`controller.conf`)
```conf
# Pinot Role
pinot.service.role=CONTROLLER
# Pinot Cluster name
pinot.cluster.name=pinot-quickstart
# Pinot Zookeeper Server
pinot.zk.server=localhost:2181
# Use hostname as Pinot Instance ID
pinot.set.instance.id.to.hostname=true
# Pinot Controller Port
controller.port=9000
controller.zk.str=pinot-zookeeper:2181
controller.vip.host=127.0.0.1
controller.vip.port=9000
controller.task.scheduler.enabled=true
controller.local.temp.dir=/var/pinot/controller/data
# Deep storage configuration
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.storage.factory.s3.disableAcl=false
pinot.controller.storage.factory.s3.region=ap-northeast-1
controller.data.dir=s3://bucket-name/
pinot.controller.segment.fetcher.protocols=file,http,s3
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.controller.storage.factory.s3.accessKey=****
pinot.controller.storage.factory.s3.secretKey=****
```
#### Server Configuration (`server.conf`)
```conf
# Pinot Role
pinot.service.role=SERVER
# Pinot Cluster name
pinot.cluster.name=pinot-quickstart
# Pinot Zookeeper Server
pinot.zk.server=localhost:2181
pinot.set.instance.id.to.hostname=true
# Pinot Server Ports
pinot.server.netty.port=8098
pinot.server.adminapi.port=8097
# Data directories and deep storage
pinot.server.instance.dataDir=/tmp/pinot/data/server/index
pinot.server.instance.segmentTarDir=/tmp/pinot/data/server/segmentTar
pinot.server.segment.store.uri=s3://bucket-name/
pinot.server.storage.factory.s3.disableAcl=false
pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.server.storage.factory.s3.region=ap-northeast-1
pinot.server.segment.fetcher.protocols=file,http,s3
pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.storage.factory.s3.accessKey=****
pinot.server.storage.factory.s3.secretKey=****
pinot.server.storage.factory.s3.endpoint=http://minio:9000
```
#### Minion Configuration (`minion.conf`)
```conf
pinot.set.instance.id.to.hostname=true
pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.minion.storage.factory.s3.region=us-east-1
pinot.minion.segment.fetcher.protocols=file,http,s3
pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
```
Please update the issue with any further observations or logs, and feel free
to add details on your environment or any steps already taken to resolve the
problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]