Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations.
I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types, most of them are StringType and the rest are LongType. The data source is a splitted json file dataframe, once the data is persisted, the result is consistent. But if I unload the memory and reload the data, the groupBy action returns different content results, missing data. Could I be missing something? this is rather serious for my analytics, and not sure how to properly diagnose this situation. Thanks, Saif