Hello,
I've got interested in Apache Druid and decided to study it. I've decided to
complete an example of sending data from one data source to another. For this
goal, I'm using ingestSegment Firehose. In the parser description, I'm adding
flattenSpec (the parser type is string, the format is json).
Here goes the configuration:
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "cp37-data8",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "__time",
"format" : "auto"
},
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "jq",
"name": "resourceItemStatusDetails_updateDateTime",
"expr": ".fullDocument_data |
fromjson.resourceItemStatusDetails.updateDateTime.\"$date\""
}
]
},
"dimensionsSpec" : {
"dimensions": [
"operationType",
"databaseName",
"collectionName",
"fullDocument_id",
"fullDocument_docId",
"resourceItemStatusDetails_updateDateTime",
{
"type": "long",
"name": "clusterTime"
}
],
"dimensionExclusions" : [
],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
}
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE"
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "ingestSegment",
"dataSource" : "cp-all-buffer",
"interval" : "2018-01-01/2020-01-03"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"maxRowsPerSegment" : 100000,
"maxRowsInMemory" : 1000
}
}
}
The task itself is executed successfully, but settings which I set up in the
parser are being ignored during the execution. I've taken a look at the source
code for Druid, and it seems that I have found a bug.
If you'll take a look at the IngestSegmentFirehoseFactory class, you'll see
that we pass only TransformSpec (which we got from the parser) to the
IngestSegmentFirehose constructor, but not the parser itself.
final TransformSpec transformSpec =
TransformSpec.fromInputRowParser(inputRowParser);
return new IngestSegmentFirehose(adapters, transformSpec, dims, metricsList,
dimFilter);
Next, in IngestSegmentFirehose we're creating a transformer and perform a
transformation.
final InputRow inputRow = rowYielder.get();
rowYielder = rowYielder.next(null);
return transformer.transform(inputRow);
During this stage, we have already lost call of the method parse on the parser,
which explains the fact that in my example parser settings were ignored.
It raises the question, why don't we just pass the parser itself to the
IngestSegmentFirehose constructor? If we'll take a look at the
TransformSpec.fromInputRowParser method implementation, we'll see that there's
always either a decorator with a transformer or error, so in the
implementations of such parsers in methods parse transformer always being
called additionally.
parser.parseBatch(row).stream().map(transformer::transform).collect(Collectors.toList());
Could please anyone clarify if this is intentional behaviour or a bug? Thanks!