Hi ,

I have created a spark job using DATASET API. There is chain of operations 
performed until the final result which is collected on HDFS.

But I also need to know how many records were read for each intermediate 
dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), 
I need to know how many records were there for each of 5 intermediate dataset. 
Can anybody suggest how this can be obtained at dataset level. I guess I can 
find this out at task level (using listeners) but not sure how to get it at 
dataset level.

Thanks

**************************************Disclaimer******************************************
 This e-mail message and any attachments may contain confidential information 
and is for the sole use of the intended recipient(s) only. Any views or 
opinions presented or implied are solely those of the author and do not 
necessarily represent the views of BitWise. If you are not the intended 
recipient(s), you are hereby notified that disclosure, printing, copying, 
forwarding, distribution, or the taking of any action whatsoever in reliance on 
the contents of this electronic information is strictly prohibited. If you have 
received this e-mail message in error, please immediately notify the sender and 
delete the electronic message and any attachments.BitWise does not accept 
liability for any virus introduced by this e-mail or any attachments. 
********************************************************************************************

Reply via email to