You can explore grouping sets in SQL and write an aggregate function to add array wise sum.
It will boil down to something like Select attr1,attr2...,yourAgg(Val) >From t Group by attr1,attr2... Grouping sets((attr1,attr2),(aytr1)) On 12 Nov 2016 04:57, "Anil Langote" <anillangote0...@gmail.com> wrote: > Hi All, > > > > I have been working on one use case and couldn’t able to think the better > solution, I have seen you very active on spark user list please throw your > thoughts on implementation. Below is the requirement. > > > > I have tried using dataset by splitting the double array column but it > fails when double size grows. When I create the double array schema data > type spark doesn’t allow me to sum them because it would be done only on > numeric types. If I think about storing the file per combination wise to > parquet there will be too much parquet files. > > > > *Input : * The input file will be like below in real data the attributes > will be *20 *& the double array would be *50,000* > > > > > > Attribute_0 > > Attribute_1 > > Attribute_2 > > Attribute_3 > > DoubleArray > > 5 > > 3 > > 5 > > 3 > > 0.2938933463658645 0.0437040427073041 0.23002681025029648 > 0.18003221216680454 > > 3 > > 2 > > 1 > > 3 > > 0.5353599620508771 0.026777650111232787 0.31473082754161674 > 0.2647786522276575 > > 5 > > 3 > > 5 > > 2 > > 0.8803063581705307 0.8101324740101096 0.48523937757683544 > 0.5897714618376072 > > 3 > > 2 > > 1 > > 3 > > 0.33960064683141955 0.46537001358164043 0.543428826489435 > 0.42653939565053034 > > 2 > > 2 > > 0 > > 5 > > 0.5108235777360906 0.4368119043922922 0.8651556676944931 > 0.7451477943975504 > > > > Now below are the possible combinations in above data set this will be all > possible combinations > > > > 1. Attribute_0, Attribute_1 > > 2. Attribute_0, Attribute_2 > > 3. Attribute_0, Attribute_3 > > 4. Attribute_1, Attribute_2 > > 5. Attribute_2, Attribute_3 > > 6. Attribute_1, Attribute_3 > > 7. Attribute_0, Attribute_1, Attribute_2 > > 8. Attribute_0, Attribute_1, Attribute_3 > > 9. Attribute_0, Attribute_2, Attribute_3 > > 10. Attribute_1, Attribute_2, Attribute_3 > > 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 > > > > Now we have to process all these combinations on input data preferably > parallel to get good performance. > > > > *Attribute_0, Attribute_1* > > > > In this iteration the other attributes (*Attribute_2, Attribute_3*) are > not required all we need is Attribute_0, Attribute_1 & double array > columns. If you see the data there are two possible combination in the data > one is 5_3 and other one is 3_2 we have to pick only those which has at > least 2 combinations in real data we will get in thousands. > > > > > > Attribute_0 > > Attribute_1 > > Attribute_2 > > Attribute_3 > > DoubleArray > > 5 > > 3 > > 5 > > 3 > > 0.2938933463658645 0.0437040427073041 0.23002681025029648 > 0.18003221216680454 > > 3 > > 2 > > 1 > > 3 > > 0.5353599620508771 0.026777650111232787 0.31473082754161674 > 0.2647786522276575 > > 5 > > 3 > > 5 > > 2 > > 0.8803063581705307 0.8101324740101096 0.48523937757683544 > 0.5897714618376072 > > 3 > > 2 > > 1 > > 3 > > 0.33960064683141955 0.46537001358164043 0.543428826489435 > 0.42653939565053034 > > 2 > > 2 > > 0 > > 5 > > 0.5108235777360906 0.4368119043922922 0.8651556676944931 > 0.7451477943975504 > > > > when we do the groupBy on above dataset with columns *Attribute_0, > Attribute_1 *we will get two records with keys 5_3 & 3_2 and each key > will have two double arrays. > > > > 5_3 ==> 0.2938933463658645 0.0437040427073041 0.23002681025029648 > 0.18003221216680454 & 0.8803063581705307 0.8101324740101096 > 0.48523937757683544 0.5897714618376072 > > > > 3_2 ==> 0.5353599620508771 0.026777650111232787 0.31473082754161674 > 0.2647786522276575 & 0.33960064683141955 0.46537001358164043 > 0.543428826489435 0.42653939565053034 > > > > now we have to add these double arrays index wise and produce the one array > > > > 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, > 0.7698036740044117] > > 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, > 0.6913180478781878] > > > > After adding we have to compute average, min, max etc on these vector and > store the results against the keys. > > > > Same process will be repeated for next combinations. > > > > > > > > Thank you > > Anil Langote > > +1-425-633-9747 > > >