Apache Pinot Daily Email Digest (2021-02-05)

Pinot Slack Email Digest Fri, 05 Feb 2021 18:00:33 -0800

#general

@harold: I have some high-level questions. I'm looking at the architectural page in the docs: 1. Regarding the terminology, is segment store the same as deep store? 2. I assume that if we didn't explicitly configure any storage (i.e., s3, hdfs), it will use the node's local disk? 3. What does Load Segment mean in the figure (specifically, what does the Server do)? Does it copy the segment to a server's local disk? Does it load it in memory? 4. Let say we have a Real Time table ingesting from Kafka and the Pinot cluster is not configured with any specific disk storage (i.e., just using local disk). Once the current consuming segment completes, the server commits it and then writes it to the segment store (i.e., local disk). Will this completed segment be loaded by another server (and start being served by another server)? Or will it just continue to be served by the same server? Are there any policies/settings/config related to this?
@npawar: 1. Yes 2. It will use the Controller’s local disk as segment store 3. Loading a segment entails downloading the segment from the segment store onto the server’s local disk, and then memory-mapping it 4. Here local disk will be the controller’s disk. The completed segment is persisted to the segment store, and also loaded by the same server that completed it (in this case, no download, just load in memory). If replication is configured, the other replicas would download the segment from the segment store (unless they also were able to complete the exact same segment). By default, the segment just stays on these og servers. We have a table setting, which moves completed segments to another set of servers, allowing you to separate the consumption and completed parts. We also have table setting for tiered storage, which can move these completed segments to another set of servers based on age of the segments.
@harold: Thanks for the reply. A few more questions: Are all segments in the segment store loaded into the servers (at all times)? If yes, does this mean that the total local disk (across all server) must be enough to serve the segments? Is there a doc that describes these table settings that you described above?
@npawar: yes and yes. Doc for moving completed segments: Doc for tiered storage:
@ashish: Thanks @npawar for quick answers. A few follow ups: 1) How is deep storage accessed? For example, when S3 is configured as deep storage, does Pinot uses S3 APIs or mounts it as a file system. What about HDFS?
@ashish: 2) Are all segments loaded in memory (based on partitioning, etc.) by the servers responsible for serving them always? Or they are loaded lazily only when needed as queries are served? If yes, how do you determine which segment will be needed for a particular query (as the indices are kept in the segment itself)? Do you use the segment metadata in the zookeeper to decide which segments are loaded in the memory?
@fx19880617: @ashish, 1) it’s using native s3 api, no mounting from pinot
@fx19880617: 2) not all in memory, all segments are memory mapped.
@npawar: We have these PinotFS implementations for each of these deep store options if you’re curious for 1:
@npawar: Lazy loading is on the roadmap
@ashish: I see - but the PinotFS cannot be used for mmap - because mmap is Java requires RandomAccessFile. Right?
@ashish: So that's why the segments needs to be copied to local filesystem for loading. Right?
@fx19880617: right, for query serving path, all segments are downloaded from pinotFS to pinot server local
@fx19880617: deep store is used for backup purpose not on query path right now. Lazy loading is on the roadmap
@ashish: Thanks @fx19880617 and @npawar. Could you please point to the design document for Lazy loading if available?
@fx19880617:
@minolino71: @minolino71 has joined the channel

#random

@rishi.jumani: @rishi.jumani has left the channel
@anshu.jalan: @anshu.jalan has left the channel
@minolino71: @minolino71 has joined the channel

#feat-text-search

@tariqahmed.farhan: @tariqahmed.farhan has joined the channel

#troubleshooting

@anshu.jalan: @anshu.jalan has left the channel
@kha.nguyen: Hi everyone, I'm currently trying to batch import some data into a Pinot Offline table and currently running into some issues. My current Pinot version is 0.7.0, currently in a docker container. I have successfully added an `offline_table_config.json` and a `schema.json` file to Pinot, however creating a segment doesn't appear to be working. A `SEGMENT-NAME.tar.gz` file isn't being created. My current docker-job-spec.yml looks like this: ```# docker-job-spec.yml executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '/tmp/pinot-manual-test/rawdata/100k' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: '/tmp/pinot-manual-test/segments/100k' overwriteOutput: true pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' tableSpec: tableName: 'rows_100k' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: - controllerURI: ''``` Some of the error messages I'm getting are ```Failed to generate Pinot segment for file - file:/tmp/pinot-manual-test/rawdata/100k/rows_100k.csv Caught exception while gathering stats java.lang.NumberFormatException: For input string: "5842432235322161941" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_282] at java.lang.Integer.parseInt(Integer.java:583) ~[?:1.8.0_282]``` Any leads on this would be appreciated. Thanks
@npawar: this looks like a mismatch in dataTypes between the Pinot schema and the actual data
@npawar: can you share the Pinot schema and some sample rows?
@ken: Isn’t 5842432235322161941 too big for an int type? I think your schema would need to use a long.
@fx19880617: yes, it’s long not int, glad it’s still smaller than LONG.MAX
@fx19880617: otherwise maybe only double
@kha.nguyen: Yes that was the issue. However this brings up another issue with my DateTime. I have a date value in milliseconds Epoch and it seems to not be able to read it. Example date in my CSV is ```1096429682806``` Schema for date is: ```# schema.json "dateTimeFieldSpecs": [{ "name": "date", "dataType": "LONG", "format" : "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" }]``` Table config is: ```"segmentsConfig": { "timeColumnName": "date", "timeType": "MILLISECONDS", "segmentPushType": "APPEND", "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", "schemaName": "row1", "replication": "1" },``` Error in the image attached:
@fx19880617: from the value, it’s secondsSinceEpoch
@fx19880617: 1612471718
@fx19880617:
@fx19880617: `1096429682806` <- this value is in the year 2004?
@kha.nguyen: yes, that value for 2004 is in my CSV
@fx19880617: I don’t see any problem with this so far, but seems that this column time is set to be seconds instead of milliseconds. @npawar anything else to check?
@npawar: `1096429682806` is the value for 2004 rt? the error says Pinot found `1612471718` which is 1970
@npawar: is that value expected? 1612471718
@kha.nguyen: from what I know, I don't have any values that match with 1612471718
@npawar: is it possible to share your input file with us? we can try to reproduce
@fx19880617: maybe 10 rows and your table config/schema
@kha.nguyen: Here is my direct CSV
@npawar: and you entire schema/table config too
@kha.nguyen:
@npawar: not able to reproduce with your table cnfig and schema. I can generate the segment just fine.
@npawar: only possible issue i see in your configs is this ```tableName: 'rows_100k' schemaURI: '' tableConfigURI: ''```
@npawar: the schema is row1, but this says rows_100k
@npawar: could it be referring to some old schema
@kha.nguyen: should the rows_100 that's not tableName reference the scehma?
@npawar: ah i see
@kha.nguyen: i'm not entirely sure, the documentation for the batch impot example uses the same value for `tableName` and `schemaName`
@fx19880617: It's using ` ```schemaURI: ''```
@fx19880617: can you check what's the response for ```schemaURI: '' tableConfigURI: ''```
@kha.nguyen: Receiving the same error message for above
@kha.nguyen: @npawar I can confirm that changing `rows_100k` to `row1` breaks it further
@fx19880617: @npawar can you share the schema and table config
@fx19880617: and Kha can use it to create the table
@npawar: i just used what he shared, only thing diff is the batch-job-spec
@npawar:
@fx19880617: which pinot image are you using? is it `apachepinot/pinot:latest`
@kha.nguyen: Yes 'm using the latest version of pinot, 0.7.0
@fx19880617: @npawar I think we need to add timeType into table config?
@fx19880617: ```{ tableName: "foo", tableType: "OFFLINE", segmentsConfig: { timeColumnName: "date", timeType: "MILLISECONDS", replication: "1" }, tenants: {}, tableIndexConfig: { loadMode: "HEAP", invertedIndexColumns: [ "id", "hash_one" ] }, metadata: { customConfigs: {} } }```
@fx19880617: also do you have logs for batch ingestion job? @kha.nguyen?
@kha.nguyen: where would the logs be found in the docker instance
@npawar: so strange, i’m also able to take your exact configs, including the yml, and upload
@npawar: i’m on the latest master, that could be the only difference
@fx19880617: `docker log <docker-container-id>`
@kha.nguyen: To clarify Neha, did you take my files, run them, and are able to successfully upload it?
@kha.nguyen: As of now, I'm going to try to remove the time column and replace it with another time format. I will come back to this on Monday. Thank you guys so much for your help @npawar @fx19880617
@fx19880617: sure, please let us know
@npawar: yes, took exactly your files. I’m not running on docker so i changed the `s/pinot-controller-test/localhost`, and was able to upload
@npawar: lets look at logs on Monday
@fx19880617: I’ve tried with docker setup and there is no issue on my side. Here are my steps: 1. Start pinot quickstart with docker ```docker run \ --network=pinot-demo \ --name pinot-quickstart \ -p 9000:9000 \ -d apachepinot/pinot:latest QuickStart \ -type batch``` 2. Create Table ```docker run --rm -ti \ --network=pinot-demo \ -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \ --name pinot-batch-table-creation \ apachepinot/pinot:latest AddTable \ -schemaFile /tmp/pinot-quick-start/foo-schema.json \ -tableConfigFile /tmp/pinot-quick-start/foo-table-offline.json \ -controllerHost pinot-quickstart \ -controllerPort 9000 -exec``` 3. Start Ingestion job ```docker run --rm -ti \ --network=pinot-demo \ -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \ --name pinot-data-ingestion-job \ apachepinot/pinot:latest LaunchDataIngestionJob \ -jobSpecFile /tmp/pinot-quick-start/docker-job-spec-100k.yml```
@fx19880617: I put table conf and schema on my local directory and mount to docker:
@npawar: i suspect there’s some stray data in the input folder for you Kha
@fx19880617: and this is the updated docker-job-spec file: ```➜ cat /tmp/pinot-quick-start/docker-job-spec-100k.yml executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '/tmp/pinot-quick-start/rawdata' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: '/tmp/pinot-manual-test/segments' overwriteOutput: true pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' tableSpec: tableName: 'foo' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: - controllerURI: ''```
@fx19880617: Kha: I feel you can delete the table and corresponding schema and retry
@minolino71: @minolino71 has joined the channel

#getting-started

@harold: @harold has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]