agrawaldevesh commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-672275450


   Hi Cheng, I am wondering if you might have a perf test handy to test this 
new implementation vs your old approach ?
   
   After going through the code and following along, I think my random idea of 
Bitset/auxiliary data structure might be a a lot more work/plumbing and may be 
inefficient compared to your older approach :-(. Let me explain:
   
   The hashed relation has a list of (key, value) stored in pages. We want to 
know which of those key-value pairs didn't match. Unfortunately, due to the 
indirection of dataPages, this list is stored in a chunked way inside pages. 
You don't have ready access to the "index" of this list but instead of other 
things like (keyIndex, key, value, valueOffset, valueBase) and the like.
   
   I don't want to lead you astray but I would advise you to not throw your old 
implementation and do some perf testing to validate that the axillary approach 
works better ?
   
   Another way to solve this would be go with the spirit of your existing 
solution of putting the "matched bit" in the key-value payloads, but just put 
that as an unsafe field deep inside BinaryMap, so that you can avail of all the 
unsafe goodness. There will be a "Location.markMatched()" call to set this 
matched bit. Later on when you iterate these page entries in the BinaryMap, you 
can skip over those with the set matched bit. But this "unsafe" approach is 
more of an optimization and can cause other regressions.
   
   So I would be curious for perf numbers for the two approaches of out of band 
matched-bit vs within the hashed relation matched-bit (older approach).
   
   All the best and looking forward to how this new approach pans out !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to