Apache Pinot Daily Email Digest (2022-04-16)

Pinot Slack Email Digest Sat, 16 Apr 2022 19:00:49 -0700

#general

@nikhil.varma: Hi all, i had setup an pinot cluster and minio s3 as a deep storage to store my data. but it is not uploading any segments to minio s3. please help if anyone worked on deepstorage with s3.
@mayanks: ok
@nikhil.varma: HI mayank
@mayanks:
@diogo.baeder: Hey folks, something I learned about Pinot batch ingestion, today: it can be a bit picky with the input configuration, so for example: ```inputDirURI: '/foo/bar' includeFileNamePattern: 'glob:baz/**/*.json'``` doesn't work if you want to ingest from JSON files inside `/foo/bar/baz`. Instead, this should be used: ```inputDirURI: '/foo/bar/baz' includeFileNamePattern: 'glob:**/*.json'``` notice how `inputDirURI` goes to the deepest possible fixed subdirectory, and then the pattern will start from there.
@kharekartik: Hi `**/baz/*.json` this should work for you You can use both `glob` or `regex` matcher here You can test out the patterns here - For example -
@ken: Hi @kharekartik - thanks for sharing that very useful site! I’d written a little tool try glob patterns, but this is much easier and more powerful.
@diogo.baeder: Thank you, guys! :slightly_smiling_face:
@prasin.ig: @prasin.ig has joined the channel

#random

@prasin.ig: @prasin.ig has joined the channel

#troubleshooting

@prasin.ig: @prasin.ig has joined the channel
@diogo.baeder: Hey folks, I noticed something strange while testing batch ingestion: apart from the normal segments created that I expect to be there in the tables I'm using, I end up with more segments with name patterns that seem to be using the batch run timestamp, and if I run the ingestion again on the same files as inputs, instead of the segments being kept as they were before (because there's no new file to be ingested), I end up with more of those segments with strange name patterns. Is this expected? The row counts don't change, and neither does the disk size taken by each table, it's just really the amount of segments that are increased somehow.
@diogo.baeder: My table uses `DATE` columns, and each input file has data for one day, so I end up with expected segments containing these dates as part of them; But the unexpected segments use millisecond timestamps as part of their names. Not sure why.
@mayanks: Do you have any minion job setup?
@diogo.baeder: Not currently, no. This happens every time I run the ingestion job - if I just leave the tables be, the segments don't change.
@mayanks: What ingestion mechanism are you using and what’s the config look like?
@diogo.baeder: I'm using the admin command to ingest the files, and here's an example job I have: ```executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '/sensitive-data/outputs/weights' includeFileNamePattern: 'glob:**/*.json' outputDirURI: '/tmp/data/segments/weights' overwriteOutput: true pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'json' className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader' tableSpec: tableName: 'weights' schemaURI: '' pinotClusterSpecs: - controllerURI: ''```
@diogo.baeder: Some of the files I have as part of the input have no relevant data, they have JSON content but just with an empty list, so I wonder if this is what's causing that issue
@mayanks: Try count(*) group by $segmentName to see how many records in each segment, specifically the ones you don’t expect
@diogo.baeder: Should I use a literal `$segmentName`? I never did a query like that...
@mayanks: Yes
@diogo.baeder: Ah, cool! Nice to know that exists :slightly_smiling_face:
@mayanks: You can also filter on segment name using that to check specific segment
@diogo.baeder: Very interesting, and the result is what I expect: only the segments with expected names have rows in them, and after I re-run the ingestion, the select result is the same, only rows in the expected segments. So I guess my hypothesis is correct, it seems like those segments are being created without data and refering to files that have "no data" (just an empty list)
@diogo.baeder: Yeah, that's it: I checked the metadata for the segments, and the unexpected ones are referring to input files with empty lists.
@diogo.baeder: Is there a way to make the ingestion bypass such files instead of create empty segments?
@mayanks: Why do you have empty files? I think empty segment pushing was used as a work around to advance time boundary, if I am not wrong
@diogo.baeder: Ah, got it... alright, that's fine too, I can just not create those empty files in the first place - just wanted to know if there could be a way to avoid ending up with those as segments, but of course it makes more sense to not have such files in the first place.

#getting-started

@harish.bohara: I have a table where eventTime is sent in kafka as unix epoch. What setting I need so the data is displayed and query in IST?
@navina: @harish.bohara If you want to display date in the query results as a date time string format , you can find an example here
@prasin.ig: @prasin.ig has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2022-04-16)

#general

#random

#troubleshooting

#getting-started

Reply via email to