agrawaldevesh commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-672275450
Hi Cheng, I am wondering if you might have a perf test handy to test this new implementation vs your old approach ? After going through the code and following along, I think my random idea of Bitset/auxiliary data structure might be a a lot more work/plumbing and may be inefficient compared to your older approach :-(. Let me explain: The hashed relation has a list of (key, value) stored in pages. We want to know which of those key-value pairs didn't match. Unfortunately, due to the indirection of dataPages, this list is stored in a chunked way inside pages. You don't have ready access to the "index" of this list but instead of other things like (keyIndex, key, value, valueOffset, valueBase) and the like. I don't want to lead you astray but I would advise you to not throw your old implementation and do some perf testing to validate that the axillary approach works better ? Another way to solve this would be go with the spirit of your existing solution of putting the "matched bit" in the key-value payloads, but just put that as an unsafe field deep inside BinaryMap, so that you can avail of all the unsafe goodness. There will be a "Location.markMatched()" call to set this matched bit. Later on when you iterate these page entries in the BinaryMap, you can skip over those with the set matched bit. But this "unsafe" approach is more of an optimization and can cause other regressions. So I would be curious for perf numbers for the two approaches of out of band matched-bit vs within the hashed relation matched-bit (older approach). All the best and looking forward to how this new approach pans out ! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org