Apache Pinot Daily Email Digest (2022-05-02)

Pinot Slack Email Digest Mon, 02 May 2022 19:00:41 -0700

#general

@laila.sabar098: @laila.sabar098 has joined the channel
@francois: Hi. Is there any way from the rest API to retreive informations to monitor like nb_messages read by consumer / nb messages indexed . The goal here in my question is to monitor the ingestion and ensure we are not missing messages. I’ve found messages like that on the pinot-all.log but I want them from API if possible. Any recomanded way ?
@kharekartik: @navi.trinity
@npawar: there’s no API as of now, but you could just monitor the metrics emitted ```REALTIME_ROWS_CONSUMED("rows", true), INVALID_REALTIME_ROWS_DROPPED("rows", false), REALTIME_CONSUMPTION_EXCEPTIONS("exceptions", true), REALTIME_OFFSET_COMMITS("commits", true), REALTIME_OFFSET_COMMIT_EXCEPTIONS("exceptions", false), REALTIME_PARTITION_MISMATCH("mismatch", false), ROWS_WITH_ERRORS("rows", false),``` If you want to see the same metrics that you’re seeing in the logs (which are just for the scope of the consuming segment, not overall) that should be easy to add to an existing /consumingSegmentsInfo API. Do you mind filing a GH issue?
@ghanta.vardhan: Hey guys, I am trying to establish jdbc connection to execute queries on pinot cluster. The pinot cluster is deployed on production environment and i am connecting from local(port forwarded pinot controller) to test the jdbc feature. I think while executing the query, the controller is resolving the broker with its name rather than IP and hence getting unknownhost exception. ```Caused by: org.apache.pinot.client.PinotClientException: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.net.UnknownHostException: pinot-broker-0.pinot-broker-headless.xxxxx-v2.svc.cluster.local: nodename nor servname provided, or not known at org.apache.pinot.client.JsonAsyncHttpPinotClientTransport.executeQuery(JsonAsyncHttpPinotClientTransport.java:104) at org.apache.pinot.client.Connection.execute(Connection.java:127) at org.apache.pinot.client.Connection.execute(Connection.java:96) at org.apache.pinot.client.PinotStatement.executeQuery(PinotStatement.java:63) ... 1 more``` Is there a way i can avoid this error because the same might happen when i move to production(Application is in different k8s cluster). TIA
@kharekartik: Hi, currently the broker hostname needs to be resolvable from the machine on which client is running
@mayanks: Yes, also please directly use broker for querying in production. The controller endpoint is only for query console, and it also calls the broker api internally
@kharekartik: The question here is regarding JDBC driver. It fetches brokers list from the provided tenant from the controller. The queries are sent to brokers only. However, it can cause issues if broker hostname:port is not resolvable from the client machine. @xiangfu0 Is there a solution for such cases ?
@mayanks: The brokers should be behind an LB and the driver can just be specified that instead of fetching from controller, right? I think it was implemented that way due to absence of LB.
@xiangfu0: so far there is no such option, one thing is to init the jdbc by just using the broker LB name, so no hostname resolution is required.
@mayanks: The pinot-java-client does have the broker list config. If it is missing from jdbc client we should add it.
@xiangfu0: right
@jinal.panchal: @jinal.panchal has joined the channel
@aswini.nellimarla: Hi, Apache Pinot can directly talk to datastores like Cassandra/Cosmos NoSql DB stores?
@francois: What do you mean by “talk” ? Joining ? ingesting ?
@mayanks: I think you mean pull data from these data stores directly? If so, not at the moment.
@aswini.nellimarla: @francois yes can Pinot pull and ingest data from/to any NoSql DBs like Cassandra?
@aswini.nellimarla: @mayanks understood. Thanks for the reply. If still we want to connect to these data stores, we can do this by Trino integration am I right?
@mayanks: Likely not. Trino Pinot connector will use data in Pinot to query via Pinot+Trino.
@aswini.nellimarla: Excellent. Thanks for the confirmation Mayank :)
@jinal.panchal: Hello, I've started exploring Pinot.. So is there any way to define primary key & foreign key relationships so that we can maintain mapping? Because, how will it support join without maintaining relationships?
@mayanks: Pinot only supports lookup join today
@jinal.panchal: So, there is no way to maintain relationships, right? We have use case like there is student table & subject table, which has foreign key relationship based on subjectID. So, is there any way by which it supports hibernate-ORM like functionality to update/modify child table(referenced) based on parent table(referencing) modification?
@mayanks: Not at the moment. You need to denormalize the tables upfront, or can use presto/trino for joins.
@jinal.panchal: Okay, so is pinot not built for applications where we need relations or relational use case?
@mayanks: It is not a relational database, it is an OLAP datastore
@erik.bergsten: We started using the "latest" tagged docker image so we can use timestamp indexes but in this version kafka sasl_plain authentication doesnt work (class not found). Is it broken or will we just have to wait for an official release to get timestamp indexes and full kafka support in one image?
@mayanks: Is there a GH issue for the the Kafka problem you are seeing?
@erik.bergsten: No, and it isnt an issue in 0.10.0. It just looks like the kakfka plain sasl login module isnt packaged in the latest docker image
@mayanks: @xiangfu0 ^^
@xiangfu0: Can you try to use the shaded path: ```shaded.org.apache.kafka.xxxx```
@erik.bergsten: @xiangfu0 it works! Will this be the standard path in 0.11 (and later)?
@xiangfu0: Thanks for pointing this out, in short, we tried to package multiple kafka consumers libs(kafka 0.9, 2.0, 3.0 etc) together, so we need to shade and relocate them separately. Let me rethink this problem and see if we can make this experience seamlessly
@jinal.panchal: Okay, so is pinot not built for applications where we need relations or relational use case?
@ysuo: Hi team, I noticed Timestamp Index is supported and tried to use it. *But there is this error.* {“code”:400,“error”:“Cannot deserialize value of type `org.apache.pinot.spi.config.table.FieldConfig$IndexType` from String \“TIMESTAMP\“: not one of the values accepted for Enum class: [INVERTED, FST, JSON, H3, TEXT, SORTED, RANGE]\n at [Source: (String)\“{\“tableName\“:\“test_time_index\“,\“tableType\“:\“REALTIME\“,\“segmentsConfig\“:{\“schemaName\“:\“test_time_index\“,\“timeColumnName\“:\“created_on\“,\“timeType\“:\“MILLISECONDS\“,\“allowNullTimeValue\“:true,\“replicasPerPartition\“:\“1\“,\“retentionTimeUnit\“:\“DAYS\“,\“retentionTimeValue\“:\“30\“,\“segmentPushType\“:\“APPEND\“,\“completionConfig\“:{\“completionMode\“:\“DOWNLOAD\“}},\“tenants\“:{},\“fieldConfigList\“:[{\“name\“:\“timestamp\“,\“encodingType\“:\“DICTIONARY\“,\“indexTypes\“:[\“TIMESTAMP\“],\“time\“[truncated 3199 chars]; line: 1, column: 483] (through reference chain: org.apache.pinot.spi.config.table.TableConfig[\“fieldConfigList\“]->java.util.ArrayList[0]->org.apache.pinot.spi.config.table.FieldConfig[\“indexTypes\“]->java.util.ArrayList[0])“} *Part of my table schema is:* “dateTimeFieldSpecs”: [ { “name”: “timestamp”, “dataType”: “TIMESTAMP”, “format”: “1:MILLISECONDS:EPOCH”, “granularity”: “1:MILLISECONDS” } *And part of my table config is:* “fieldConfigList”: [ { “name”: “timestamp”, “encodingType”: “DICTIONARY”, “indexTypes”: [“TIMESTAMP”], “timestampConfig”: { “granularities”: [ “DAY”, “WEEK”, “MONTH” ] } } ] Any idea how to fix it?
@mayanks: @jackie.jxt
@mayanks: What version of Pinot?
@ysuo: I’m using Pinot 0.10.0 and I’m referring to this doc .
@jackie.jxt: This feature is not released yet. We should add a note in the documentation denoting it will be available in the next release
@ysuo: I see. Thanks.
@mailtorahuljain: @mailtorahuljain has joined the channel
@pedro.j.santos: @pedro.j.santos has joined the channel
@ricardoruas88: @ricardoruas88 has joined the channel
@padma: Hi all, I am working on improving the query latency for my realtime time series table. There is no corresponding offline table and all the data is realtime data. It has about 61 billion records with 3.5 million unique ids and a size of 2.7 TB. I have the range index set as the timestamp and the unique id as the inverted index. I have the incoming streaming data coming from kafka partitioned. I have the segmentation strategy set to the default of balanced segmentation. Stats are saying that there are 2 servers queried, 34 segments matched, 34 segments processed and 34 segments matched. I am getting a query response time of ~2 seconds and sometimes 4 sec and repeated querying is giving me 50 ms. Would the following changes improve the query performance? 1. Changing the segmentation strategy to Partitioned Replica-Group Segment Assignment 2. Bloom filter (does it improve the performance for individual queries or aggregate queries only?) 3. I am assuming star tree index helps with aggregation and not independent records 4. we have the partitioning set as murmur in the table config 5. How can I allocate / increase the hot/warm memory 6. Tenants are set to DefaultTenant for both server and broker. Would changing this improve? If so, what should be changed 7. Would enabling default star tree and dynamic start tree creation help? 8. Would disabling nullhandling affect the performance? Its currently set to true, but i dont expect null values for the indexed id and timestamp fields 9. Should I set autoGeneratedInvertedIndex and createInvertedIndexDuringSegmentGeneration to true. They are false currently
@mayanks: Few questions: • What’s the read qps? • Broker/Server VM cpu/mem • What’s the JVM configurations?
@padma: Its not much currently. Even with 1 query, we are getting this low perf
@padma: Its not being actively used.. Just testing perf against the table on query console
@padma: server mem is 32 g and 8 cpu - We have 42 servers
@padma: Same configuration for broker and we have 3 brokers
@padma: server jvm used is around 9 gb avg across the servers
@padma: server cpu is about 20%
@mayanks: Are local disks attached to server SSD?
@padma: This is all setup on AWS
@mayanks: Is EBS SSD?
@mayanks: Also, can you share the broker response metadata and the log when query takes 4s
@padma: Is there a way to check if the EBS is SSD?
@padma: broker latency is 1 second
@padma: let me share the log
@padma: you need the broker log?
@mayanks: Just the log line for the query request
@mayanks: And also the response metadata returned by broker
@padma: it could be any of the broker instances right?
@padma: should I look at each of the broker logs?
@padma: ```[BaseBrokerRequestHandler] [jersey-server-managed-async-executor-204831] requestId=17175209,table=xxx_REALTIME,timeMs=4407,docs=84901/508174840,entries=0/1358416,segments(queried/processed/matched/consuming/unavailable):36/36/36/1/0,consumingFreshnessTimeMs=1651536519201,servers=2/2,groupLimitReached=false,brokerReduceTimeMs=20,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);pinot-server-34_R=0,4383,3116753,1,-1;pinot-server-35_R=0,557,2997678,1,-1,offlineThreadCpuTimeNs=0,realtimeThreadCpuTimeNs=0```
@padma: Also, numEntriesScannedInFilter is 0 - what does it mean?
@padma: and numEntriesScannedPostFilter is 1358416 while numDocsScanned is 84901
@padma: that seems pretty high
@padma: Anything else you can suggest other than increasing the resources?
@brandon308: @brandon308 has joined the channel

#random

#troubleshooting

@laila.sabar098: @laila.sabar098 has joined the channel
@jinal.panchal: @jinal.panchal has joined the channel
@jinal.panchal: Hello, I've started exploring Pinot.. So is there any way to define primary key & foreign key relationships so that we can maintain mapping?
@diogo.baeder: Hi folks, let me ask for your opinion on modeling tables in Pinot. Suppose (just a fake case for simple illustration) that you had a data source where you have users having different objects at home, where the types and names of these objects are dynamic, and you wanted to have a way to store them in such a way that you could be able to query them by objects amounts, like finding users that have 2 cars and 2 TVs. Considering that you don't know what objects would be coming in beforehand, how would you model this? JSON field for the objects, to keep them in a single row that represents the individual user? Spreading the objects as different rows and then aggregating and filtering at the application side? How would you guys model this?
@g.kishore: Model as json type columns as long the objects a single user holds does not run into hundreds of thousands
@g.kishore: This will be the fastest and most efficient
@diogo.baeder: Hmm... In some cases I might have about 100 items or so, but I hope this doesn't turn out to be a problem
@mayanks: Should be fine
@g.kishore: One thing missing in Pinot right now is ability to have configure indexes for each field within a json
@g.kishore: We only do inverted index rt now by default
@g.kishore: Will be great if you can file an issue for this
@diogo.baeder: I can, yes. Will do ASAP. Thanks again! :slightly_smiling_face:
@mayanks: Also is there structure to it, or you just want to do text match on bunch of strings?
@diogo.baeder: Something like, imagine a user has: • TVs: 2 • Cars: 1 And another user has: • TVs: 1 • Dogs: 4 So each user has a certain "thing" and then a certain amount of that thing. Just one level, no complex structure really. But the problem is that users would have different things, hence me not being able to define them as columns.
@g.kishore: yeah json is right
@diogo.baeder: Initially I structured this as each "thing" being a separate row, but it turned out to have an obvious problem: it would be impossible to filter users that have "2 TVs and 1 car", for example.
@ysuo: Maybe you can transform one json to multiple records with TVs/Dogs/.. stored in a field named as type and 2/1/.. stored in a field named as amount.
@g.kishore: yes, thats also another idea and commonly used. the only drawback with that is if you want to get count of users who have TVs and Dogs. That will require distinctCount vs Count with json.
@diogo.baeder: Thanks for the hint, Alice! But I was doing that already, and it didn't solve my problem because I ended up not being able to correlate different rows in the same query (e.g. "users that have 2 TVs, and either 2 dogs or 2 cars")
@g.kishore: how big is the dataset
@diogo.baeder: I don't know yet how big the dataset will be, in total it will probably be in the order of a few terabytes.
@mailtorahuljain: @mailtorahuljain has joined the channel
@pedro.j.santos: @pedro.j.santos has joined the channel
@ricardoruas88: @ricardoruas88 has joined the channel
@brandon308: @brandon308 has joined the channel

#custom-aggregators

@himanshu.rathore: @himanshu.rathore has joined the channel

#query-latency

@himanshu.rathore: @himanshu.rathore has joined the channel

#pinot-dev

@wcxzjtz:
@atri.sharma: Is there a way to set the number of segments required when creating a test data set for an integration test?
@amrish.k.lal: Not sure if this is exactly what you are looking for, but in one of my unit test cases I created a table over two segments in the following way: ```@BeforeClass public void setUp() throws Exception { FileUtils.deleteDirectory(INDEX_DIR); List<GenericRow> records1 = new ArrayList<>(NUM_RECORDS); records1.add(createRecord(120, 200.50F, "albert1", "albert", 1643666769000L)); records1.add(createRecord(250, 32.50F, "martian1", "mouse", 1643666728000L)); records1.add(createRecord(310, -44.50F, "martian2", "mouse", 1643666432000L)); records1.add(createRecord(340, 11.50F, "donald1", "duck", 1643666726000L)); records1.add(createRecord(110, 16, "goofy1", "goofy", 1643667762000L)); records1.add(createRecord(150, 12, "goofy2", "goofy", 1643667762000L)); records1.add(createRecord(100, -28, "daffy1", "daffy", 1643667092000L)); records1.add(createRecord(120, -16, "pluto1", "dwag", 1643666712000L)); records1.add(createRecord(120, -16, "zebra1", "zookeeper", 1643666712000L)); records1.add(createRecord(220, -16, "zebra2", "zookeeper", 1643666712000L)); createSegment(records1, SEGMENT_NAME_LEFT); ImmutableSegment immutableSegment1 = ImmutableSegmentLoader.load(new File(INDEX_DIR, SEGMENT_NAME_LEFT), ReadMode.mmap); List<GenericRow> records2 = new ArrayList<>(NUM_RECORDS); records2.add(createRecord(150, 10.50F, "alice1", "wonderland", 1650069985000L)); records2.add(createRecord(200, 1.50F, "albert2", "albert", 1650050085000L)); records2.add(createRecord(32, 10.0F, "mickey1", "mouse", 1650040085000L)); records2.add(createRecord(-40, 250F, "minney2", "mouse", 1650043085000L)); records2.add(createRecord(10, 4.50F, "donald2", "duck", 1650011085000L)); records2.add(createRecord(5, 7.50F, "goofy3", "duck", 1650010085000L)); records2.add(createRecord(5, 4.50F, "daffy2", "duck", 1650045085000L)); records2.add(createRecord(10, 46.0F, "daffy3", "duck", 1650032085000L)); records2.add(createRecord(20, 20.5F, "goofy4", "goofy", 1650011085000L)); records2.add(createRecord(-20, 2.5F, "pluto2", "dwag", 1650052285000L)); createSegment(records2, SEGMENT_NAME_RIGHT); ImmutableSegment immutableSegment2 = ImmutableSegmentLoader.load(new File(INDEX_DIR, SEGMENT_NAME_RIGHT), ReadMode.mmap); _indexSegment = null; _indexSegments = Arrays.asList(immutableSegment1, immutableSegment2); }```
@atri.sharma: Rather than manually merging avro files together?
@dadelcas: hey there, can I get someone to review this PR? I'll need this to finish implementing timestamp and json support in trino connector. I've left a comment with regards to timestamp and time zones, I'll raise separate issue for that if there isn't one yet
@mayanks: Thanks for your contribution, will review. cc: @jackie.jxt

#pinot-perf-tuning

@himanshu.rathore: @himanshu.rathore has joined the channel

#getting-started

@laila.sabar098: @laila.sabar098 has joined the channel
@jinal.panchal: @jinal.panchal has joined the channel
@mailtorahuljain: @mailtorahuljain has joined the channel
@pedro.j.santos: @pedro.j.santos has joined the channel
@ricardoruas88: @ricardoruas88 has joined the channel
@brandon308: @brandon308 has joined the channel
@brandon308: Hello, I'm just getting started and wondering if there is any documentation on how to use the pulsar plugin for stream ingestion?
@mayanks: Seems like we need to add docs @kharekartik
@mayanks: In the meanwhile

#introductions

@laila.sabar098: @laila.sabar098 has joined the channel
@jinal.panchal: @jinal.panchal has joined the channel
@mailtorahuljain: @mailtorahuljain has joined the channel
@pedro.j.santos: @pedro.j.santos has joined the channel
@ricardoruas88: @ricardoruas88 has joined the channel
@brandon308: @brandon308 has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2022-05-02)

#general

#random

#troubleshooting

#custom-aggregators

#query-latency

#pinot-dev

#pinot-perf-tuning

#getting-started

#introductions

Reply via email to