Apache Pinot Daily Email Digest (2021-03-04)

Pinot Slack Email Digest Thu, 04 Mar 2021 18:00:42 -0800

#general

@amommendes: @amommendes has joined the channel

#random

@amommendes: @amommendes has joined the channel

#troubleshooting

@mohammedgalalen056: Hi, I faced this error when trying to do BatchIngestion from the local file system `Failed to generate Pinot segment for file - file:data/orders.csv` `java.lang.NumberFormatException: For input string: "2019-05-02 17:49:53"` here is the dateTimeFieldSpecs in the schema file: ```"dateTimeFieldSpecs": [ { "dataType": "STRING", "name": "start_date", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "end_date", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "created_at", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "updated_at", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]```
@ken: What’s the full schema? Looks like you’ve got a numeric (metrics or dimensions) field, but the data in your input file is a date.
@mohammedgalalen056: ```{ "schemaName": "orders", "metricFieldSpecs": [ { "dataType": "DOUBLE", "name": "total" }, { "dataType": "FLOAT", "name": "percentage" } ], "dimensionFieldSpecs": [ { "dataType": "INT", "name": "id" }, { "dataType": "STRING", "name": "user_id" }, { "dataType": "STRING", "name": "worker_id" }, { "dataType": "INT", "name": "job_id" }, { "dataType": "DOUBLE", "name": "lat" }, { "dataType": "DOUBLE", "name": "lng" }, { "dataType": "INT", "name": "work_place" }, { "dataType": "STRING", "name": "note" }, { "dataType": "STRING", "name": "address" }, { "dataType": "STRING", "name": "canceled_by" }, { "dataType": "INT", "name": "status" }, { "dataType": "STRING", "name": "canceled_message" } ], "dateTimeFieldSpecs": [ { "dataType": "STRING", "name": "start_date", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "end_date", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "created_at", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "dataType": "STRING", "name": "updated_at", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ] }```
@ken: I’d take a few rows of your input data and dump into Excel, to confirm the order/number of columns matches what you’ve defined in your schema.
@mohammedgalalen056: I've fixed the error, the raw data was corrupted
@fabricio.dutra87: Hi all, I'm trying to ingest data from kafka using a topic that doesnt has a datetime column and receving this error: ```{"code":400,"error":"Schema should not be null for REALTIME table"}``` I'm using this spec: ```curl -X POST "" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"tableName\": \"realtime_strimzi_dev_acks\", \"tableType\": \"REALTIME\", \"segmentsConfig\": { \"segmentPushType\": \"REFRESH\", \"schemaName\": \"sch_strimzi_acks\", \"replication\": \"1\", \"replicasPerPartition\": \"1\" }, \"tenants\": {}, \"tableIndexConfig\": { \"loadMode\": \"MMAP\", \"invertedIndexColumns\": [ \"column1\" ], \"streamConfigs\": { \"streamType\": \"kafka\", \"stream.kafka.consumer.type\": \"lowlevel\", \"stream.kafka.topic.name\": \"producer-test-strimzi-dev-acks-0\", \"stream.kafka.decoder.class.name\": \"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder\", \"stream.kafka.consumer.factory.class.name\": \"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory\", \"stream.kafka.broker.list\": \"edh-kafka-brokers.ingestion.svc.Cluster.local:9092\", \"realtime.segment.flush.threshold.time\": \"3600000\", \"realtime.segment.flush.threshold.size\": \"50000\", \"stream.kafka.consumer.prop.auto.offset.reset\": \"smallest\" } }, \"metadata\": { \"customConfigs\": {} }}"``` Is there a way to create a realtime table autofilling/creating a datetime column?
@g.kishore: did you upload the schema first?
@fabricio.dutra87: yes, but I had the same error message
@npawar: Can you paste the schema here?
@fabricio.dutra87: I'm not including a timefieldspec as I dont have it on my Kafka topic. So would be nice if there was a way to autofill a datetime column on Pinot. That's the spec: ```{ "schemaName": "sch_strimzi_ack", "dimensionFieldSpecs": [ { "name": "column1", "dataType": "STRING" } ] }```
@chinmay.cerebro: Auto creating a time stamp column is not supported as of now. Do you have any column in Kafka that we can derive time stamp from ?
@g.kishore: You can probably use now() udf
@fabricio.dutra87: hmm ok. We will try then to implement the workaround by including the datetime column on that topic. Thanks guys!!
@npawar: also, its failing in the first place because the schema name is not matching what you’ve put in the table config
@npawar: ```sch_strimzi_ack``` vs ```"schemaName\": \"sch_strimzi_acks\``` plural
@npawar: hence the schema not found exception
@npawar: we can make that exception clearer. Do you mind creating an issue on github?
@fabricio.dutra87: thanks Neha, the error was clearer when I fixed the name: ```{"code":400,"error":"'timeColumnName' cannot be null in REALTIME table config"}```
@falexvr: Hey guys, for some reason every query I sent to pinot is only returning 10 records at most, only if I specify a limit it brings more than 10 records, is there something I have to do to get the full amount of records?
@g.kishore: yes default limit is 10
@g.kishore: you can specify limit 1000 to get more records
@g.kishore: or 10000
@amommendes: @amommendes has joined the channel

#aggregators

@ita.pai: @ita.pai has joined the channel
@ita.pai: @ita.pai has left the channel

#pinot-dev

@ken: Currently `DistinctCountHLL` only works for single value fields. It seems like a simple change in `DistinctCountHLLAggregationFunction.aggregate()` to check if the `BlockValSet` is multi-valued, and if so then call `BlockValSet.getXXXMV()` and do a sub-iteration on the secondary array it returns. Does that make sense?
@g.kishore: Surprised that’s it’s not supported as of now
@ken: If you run this query on a MVF, you get: ``` "message": "QueryExecutionError:\njava.lang.UnsupportedOperationException\n\tat org.apache.pinot.core.segment.index.readers.ForwardIndexReader.readDictIds(ForwardIndexReader.java:84)\n\tat org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:439)\n\tat org.apache.pinot.core.common.DataFetcher.fetchStringValues(DataFetcher.java:146)\n\tat org.apache.pinot.core.common.DataBlockCache.getStringValuesForSVColumn(DataBlockCache.java:194)\n\tat org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getStringValuesSV(ProjectionBlockValSet.java:94)\n\tat org.apache.pinot.core.query.aggregation.function.DistinctCountHLLAggregationFunction.aggregate(DistinctCountHLLAggregationFunction.java:103)\n\tat org.apache.pinot.core.query.aggregation.DefaultAggregationExecutor.aggregate(DefaultAggregationExecutor.java:47)\n\tat org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:66)\n\tat org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:35)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator$1.runJob(BaseCombineOperator.java:94)\n\tat org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)"```
@ken: I’ll file an issue and generate a PR
@mayanks: @ken can you try `distinctCountHLLMV`?
@mayanks: Aggregation functions on MV columns have an `MV` suffix in the name.
@ken: @mayanks Thanks for clarifying, I was confused by seeing `aggregate`, `aggregateGroupBySV`, and `aggregateGroupByMV`. Made me think there was a missing `aggregateMV` function. I see now that the `BySV` and ByMV` methods are for doing aggregations when the grouping column is SV vs. MV.
@mayanks: :+1:
@ken: @mayanks But why does there need to be a different function? In the implementations the function signatures are the same, and (I assume) the `BlockValSet` could be used to determine whether to handle it as an SV or an MV column.
@mayanks: Yeah, in future, we might merge the two.
@ken: OK, I’ll change my issue description :slightly_smiling_face:
@mayanks: sounds good

#community

@amommendes: @amommendes has joined the channel

#announcements

@amommendes: @amommendes has joined the channel

#getting-started

@phuchdh: @phuchdh has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-03-04)

#general

#random

#troubleshooting

#aggregators

#pinot-dev

#community

#announcements

#getting-started

Reply via email to