Apache Pinot Daily Email Digest (2021-06-03)

Pinot Slack Email Digest Thu, 03 Jun 2021 19:00:34 -0700

#general

@lakshmanan.velusamy: Hello Community, We are trying to add user defined scalar functions. Can these functions be used in star tree index for pre-aggregation ?
@g.kishore: Scalar functions are one to one. Aggregation Functions are many to one. you will have to implement ```public interface AggregationFunction<IntermediateResult, FinalResult extends Comparable> ```
@lakshmanan.velusamy: Thanks @g.kishore. Does the star tree index support custom aggregate functions ?
@g.kishore: yes, you will have to implement the interface I pasted above and have that available in class path
@kkmagic99: @kkmagic99 has joined the channel
@jmeyer: Hello :slightly_smiling_face: Is there a whirlwind tour of Pinot's code base available somewhere ? Some pointers on where to start ?
@mayanks: What would you like to do? Based on that I can point you to the right direction in the code. You can familiarize yourself with Pinot architecture (if not already done)
@jmeyer: To get to know & understand Pinot more. I think I've read most of the existing documentation and wondering how to get a better understanding overall :)
@jmeyer: But yeah, a sort of codebase ''entrypoint'' would be a start :sourire: (for queries, for example)
@npawar: What helped me was running the quick start and then following the query in debug mode, and also ingestion. Another good way would be to pick up a beginner issue from GitHub. Lemme know if you'd like to do that,I can point you to some!
@mayanks: Yep this ^^
@savio.teles: Hello. What happens when upsert during the real-time ingestion with primary key and event time equals? The documentation says: "When two records of the same primary key are ingested, _the record with the greater event time (as defined by the time column) is used_.". But when there is a tie, what happens?
@g.kishore: @yupeng @jackie.jxt will know the exact answer. As part of partial upsert work we are doing, we will make the merging logic pluggable/configurable
@yupeng: Currently the behavior is undefined, so it’s implementation based which is the message that has largest offset. However, there are caveats such as for the case where the records are sorted by some column, the order is not determined
@mayanks: @yupeng mind adding to docs/FAQ?
@savio.teles: I have an upsert scenario that I am not getting my head around. We have created an order table (configuration in the file below) with primary_key being "kid" and the time column being "_order_date". We have duplicated rows (same "kid" and "_order_date") in our table. But, upsert is keeping duplicates at the end. Depending on the filter of the query it returns duplicate items to me. See the screenshots below.
@jackie.jxt: I don’t see upsert is enabled in the table config. Please follow the instructions here to enable the upsert:
@jackie.jxt: Please make sure the Kafka stream is partitioned on the primary key
@savio.teles: Sorry @jackie.jxt. Here is the updated file! The one I had sent was from the previous version. The error happens with the table created with this schema below.
@savio.teles:
@jackie.jxt: Is the Kafka stream partitioned with the primary key?
@jackie.jxt: Can you try this query: select kid, $hostName from orders2 where kid = ‘ever max-100000009’
@yupeng: @mayanks sure thing
@savio.teles: @jackie.jxt. The results to the query are in the image below. I don't know if the Kafka stream is partitioned with the primary key. I will check here and get back to you. Tks
@jackie.jxt: @savio.teles The kafka stream is not properly partitioned, and the same `kid` shows up in 2 different partitions
@savio.teles: Tks, @jackie.jxt.
@cgoddard: @cgoddard has joined the channel

#random

@kkmagic99: @kkmagic99 has joined the channel
@cgoddard: @cgoddard has joined the channel

#feat-presto-connector

@keweishang: @keweishang has joined the channel

#troubleshooting

@elon.azoulay: We ran into an issue where a thread was blocked, and it caused the entire cluster to stop processing queries due to the worker (`pqw` threads in default scheduler) threadpool on one server being blocked. Would it help to make the worker thread pool use a cached threadpool while still keeping the query runner threadpool (`pqr` threads in default scheduler) fixed? Or do you recommend using one of the other query schedulers? Here is the thread dump:
@mayanks: Hmm, could you elaborate on how just one thread being blocked caused the entire cluster to stop processing?
@mayanks: Also, any insights on why it was blocked?
@elon.azoulay: Yep, was really strange but happened multiple times, coinciding with large # of segments scanned for a particular table. Only one `pqr` threadpool on one worker was blocked, but no queries in that tenant were completing.
@elon.azoulay: We would see messages in the logs like this:
@elon.azoulay:
@elon.azoulay: Thread was blocked (or deadlocked?) reading.
@elon.azoulay: maybe it was deadlocked with a thread ingesting realtime segments?
@mayanks: I think the root cause might be something else, I don't see how one thread can block the entire cluster.
@elon.azoulay: Either way each time it happened just bouncing that one server fixed the issue.
@elon.azoulay: I checked threads on all the servers in that tenant and only one particular server would have `pqw` threads in BLOCKED state with a similar thread dump each time
@elon.azoulay: I did see that direct memory dropped and mmapped memory suddenly would spike each time this happened.
@elon.azoulay: And no queries completed (for tables on that tenant) until the server was restarted.
@mayanks: That probably means segment completion
@elon.azoulay: Any idea what would cause that?
@mayanks: The direct mem vs mmap is segment completion.
@mayanks: Are you saying one thread in one server is causing entire cluster to stop processing?
@mayanks: Or are you saying all/several pqw in one server?
@elon.azoulay: Only for queries hitting tables on that tenant
@elon.azoulay: and only 1 server had blocked pqw threads, always with the same thread dump
@mayanks: Ok, multiple blocked pqw threads on 1 server, then?
@elon.azoulay: From the thread dump I see this code:
@elon.azoulay: ``` public void add(int dictId, int docId) { if (_bitmaps.size() == dictId) { // Bitmap for the dictionary id does not exist, add a new bitmap into the list ThreadSafeMutableRoaringBitmap bitmap = new ThreadSafeMutableRoaringBitmap(docId); try { _writeLock.lock(); _bitmaps.add(bitmap); } finally { _writeLock.unlock(); } } else { // Bitmap for the dictionary id already exists, check and add document id into the bitmap _bitmaps.get(dictId).add(docId); } }```
@elon.azoulay: Maybe bcz a segment was added while the unsynchronized code was getting the size or calling get? (i.e. the first and last lines)
@elon.azoulay: And although it was multiple pqw threads blocked on 1 server, there were still runnable pqw threads. Each time this happened it was the same scenario: 1 server, 1-3 threads blocked with that same thread dump.
@mayanks:
@mayanks: Are you using text index?
@mayanks: if not then nm
@elon.azoulay: ok:)
@elon.azoulay: we're not using text indexes on this table at all
@mayanks: what release?
@elon.azoulay: 0.6.0
@elon.azoulay: we're testing 0.7.1, do you recommend we upgrade?
@mayanks: I am curious to know what is happening here. I didn't catch any race conditions in my radar recently.
@elon.azoulay: yep, I just noticed that the size() and get() are called unsynchronized, but it doesn't seem like that would lead to any deadlocks...
@elon.azoulay: From the logs it looks like a thread was stuck reading - also not sure why it would block other queries from completing.
@mayanks: I can see how multiple pqw threads blocked can cause this. But single one, I am not so sure.
@elon.azoulay: there were multiple pqw threads blocked (1-3 out of 8 ) but there were also runnable pqw threads each time
@mayanks: There's a metric for wait time in scheduler queue for the query
@elon.azoulay: Nice, is it in the fcfs code?
@mayanks: ```ServerQueryExecutorV1Impl```
@elon.azoulay: thanks!
@mayanks: Check for SCHEDULER_WAIT
@mayanks: see if that spikes
@mayanks: If so, then it is building a backlog
@mayanks: One way to mitigate that is to set table level timeout
@elon.azoulay: ok, and the other thing that happened each time was a huge (like 75%) drop in direct memory used, and a spike in mmapped memory.
@elon.azoulay: Oh nice, how do we set the timeout?
@mayanks: ```QueryConfig```
@mayanks: Inside of Table config
@mayanks: Although I think since in java you cannot force interrupt a thread, not sure if that will preempt this blocked thread.
@mayanks: Perhaps it does
@mayanks: But again, this is mitigation.
@mayanks: I'd recommend filing an issue with as much details as possible. We should get to the root cause.
@elon.azoulay: Sure, I saved a bunch of logs and metrics...
@mayanks: Seems like some race condition might be causing some threads to block, building a backlog (please check the scheduler wait metric)
@elon.azoulay: Would changing pqw to a cached threadpool be dangerous?
@elon.azoulay: Yep, I'm looking for it now, thanks!
@mayanks: need to think about repercussions of that. I'd rather get to bottom of the issue instead of trying something.
@elon.azoulay: yep, makes sense
@elon.azoulay: thanks, I'll look into the metrics
@mayanks: Please also include info on if segment commit was happening, and if that failed or something, any special characteristics of the table/kafka-topic on the table this happens, etc.
@mayanks: This will help get some clues.
@elon.azoulay: Sure, I'll file an issue, thanks for the advice!
@mayanks: thanks for reporting.
@elon.azoulay: So I do see spikes in the scheduler wait metric:
@elon.azoulay:
@mayanks: Do they correspond to the time when issue happened? I’d so try setting table level timeout as a fallback but we should still debug the issue
@elon.azoulay: yep
@elon.azoulay: also another note: using the trino connector I could query the servers directly with no blocks or delays
@elon.azoulay: same table...
@elon.azoulay: it was only broker queries which were blocked - not sure why restarting 1 server would fix the issue each time.
@mayanks: Hmm, broker to server connection is async now (since a long time), so not sure why that would happen
@mayanks: Oh I think it could be because Trino to server connection uses different thread pool? (not shared with pqw)?
@mayanks: Also tagging @jackie.jxt
@jackie.jxt: If it is always the same server that is blocking, maybe check hardware failure?
@jackie.jxt: E.g. block on IO because one segment is unreadable
@jackie.jxt: (Just random guess)
@ken: Hmm, I had something similar happen when I did a `DISTINCT` query on a column with very high cardinality. A broker somehow got “wedged”, and would no longer process queries - we had to bounce it. IIRC the guess at that time was some issue with the connection pool getting hung, when the response from a server was too big.
@mayanks: Yeah so the end result is queries piling up and timing out. And may be mitigated by choosing a more appropriate table level query timeout. In your case it was a expensive query. but here I am more concerned about the possible deadlock
@elon.azoulay: Yeah, it happened today again - sorry for the delay. I saved thread dumps (all similar). Today it was multiple servers, all pqw pool.
@elon.azoulay: And the schedule wait metric showed pileups again. Seemed like it wasn't one large query, many small ones blocked...
@elon.azoulay:
@kkmagic99: @kkmagic99 has joined the channel
@patidar.rahul8392: It's showing this msg but data is not reflecting in table.
@patidar.rahul8392: This is the configuration file
@patidar.rahul8392: Hi Everyone, I am Loading Data from HDFS location to Pinot Hybrid Table.I have Pushed data for 5 days and executed this command 5 time ,one time for each day file. hadoop jar \ ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ -jobSpecFile /home/rtrs/hybrid/config/final/executionFrameworkSpec.yaml In the end when I am doing select * from tablename_OFFLINE. I am able to see only latest data .i.e. 5th day's data. This is the timestamp column value in my data "current_ts":"2021-05-30T23:34:31.624000" This is the details from Schema file for timestamp column. "dateTimeFieldSpecs": [ { "name": "current_ts", "dataType": "STRING", "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSSSS", "granularity": "1:MILLISECONDS" } This is the details from offline_config.json file "tableType": "OFFLINE", "segmentsConfig": { "timeColumnName": "current_ts", "replication": "1", "replicasPerPartition": "1", Looks like some timestamp Issue. Kindly suggest what i need to change here.
@jackie.jxt: Can you please check how many segments are there in this table?
@patidar.rahul8392: @jackie.jxt gimme 2 min I m trying to reload the files. Will share the details within 5 min
@patidar.rahul8392: It's creating one one segment @jackie.jxt
@patidar.rahul8392: Segment status is showing as good and server is stats is Online for this
@jackie.jxt: Do you mean only one segment in the table? What is the segment name? I'm suspecting the job is generating the segment with same name everyday and keep replacing the segment
@patidar.rahul8392: Segment name is exactly same as my tablename
@patidar.rahul8392: Which I have mention as segment.name.prefix in my conf file
@patidar.rahul8392:
@patidar.rahul8392: This is the segment name @jackie.jxt
@jackie.jxt: In your table config, do you specify the `pushType` as `REFRESH`?
@patidar.rahul8392: Yes you were try @jackie.jxt it's replacing the segments I have tried different segment name for next file so now 2 segments are the and data is the for 2 days
@patidar.rahul8392:
@patidar.rahul8392: Where I need to specify this @jackie.jxt I am not using this property
@patidar.rahul8392: It offline table config also I am not using this property
@jackie.jxt: What is your `ingestionConfig`?
@jackie.jxt: You can have it like: ```"ingestionConfig": { ..., "batchIngestionConfig": { "segmentIngestionType": "APPEND", "segmentIngestionFrequency": "DAILY" } }```
@patidar.rahul8392: Ok so this changes only I need to add in offline-table-config.json rtr? Not for realtime-config.json?
@patidar.rahul8392:
@jackie.jxt: Yes, no need to add it to the realtime-table-config
@ken: What is in your `executionFrameworkSpec.yaml` file? You typically need a section like: ```segmentNameGeneratorSpec: # type: Current supported types are 'simple' and 'normalizedDate'. type: normalizedDate configs: segment.name.prefix: 'ads_batch'``` So that segment names vary by date. If you would wind up with multiple results that have the same name (by date) you can add `exclude.sequence.id: false` to the `configs` section.
@patidar.rahul8392: @npawar @fx19880617 @ken @mayanks @elon.azoulay
@cgoddard: @cgoddard has joined the channel

#pinot-k8s-operator

@vananth22: @vananth22 has joined the channel

#pinot-dev

@vananth22: @vananth22 has joined the channel
@vananth22: :wave: I'm curious to know if there any recent conversation around supporting the window functions in Pinot :bow:
@g.kishore: Yes.. we plan to add something more powerful than window function.. match_recognize

#getting-started

@keweishang: @keweishang has joined the channel
@hongtaozhang: @hongtaozhang has joined the channel

#complex-type-support

@vananth22: @vananth22 has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org