pengding-stripe commented on code in PR #14443:
URL: https://github.com/apache/pinot/pull/14443#discussion_r1842602704
##########
pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-common/src/main/java/org/apache/pinot/plugin/ingestion/batch/common/SegmentGenerationTaskRunner.java:
##########
@@ -167,19 +167,25 @@ private SegmentNameGenerator
getSegmentNameGenerator(SegmentGeneratorConfig segm
return new
InputFileSegmentNameGenerator(segmentNameGeneratorConfigs.get(FILE_PATH_PATTERN),
segmentNameGeneratorConfigs.get(SEGMENT_NAME_TEMPLATE),
inputFileUri, appendUUIDToSegmentName);
case BatchConfigProperties.SegmentNameGeneratorType.UPLOADED_REALTIME:
- Preconditions.checkState(segmentGeneratorConfig.getCreationTime() !=
null,
- "Creation time must be set for uploaded realtime segment name
generator");
-
Preconditions.checkState(segmentGeneratorConfig.getUploadedSegmentPartitionId()
!= -1,
+
Preconditions.checkState(segmentNameGeneratorConfigs.get(BatchConfigProperties.SEGMENT_UPLOAD_TIME_MS)
!= null,
+ "Upload time must be set for uploaded realtime segment name
generator");
+
Preconditions.checkState(segmentNameGeneratorConfigs.get(BatchConfigProperties.SEGMENT_PARTITION_ID)
!= null,
Review Comment:
Users need to partition data themselves. We have a spark job to replicate
how stream ingestion partition data and put partitioned data in a path with
partition id. e.g. `s3://.../partition=0/partitioned-data.parquet`
Then create segment task can get partition id from the path. So there will
be one spark task per partition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]