Re: DataSet is not able to handle 50,000 columns to sum

ayan guha Fri, 11 Nov 2016 16:10:41 -0800

You can explore grouping sets in SQL and write an aggregate function to add
array wise sum.


It will boil down to something like

Select attr1,attr2...,yourAgg(Val)
>From t
Group by attr1,attr2...
Grouping sets((attr1,attr2),(aytr1))
On 12 Nov 2016 04:57, "Anil Langote" <anillangote0...@gmail.com> wrote:

> Hi All,
>
>
>
> I have been working on one use case and couldn’t able to think the better
> solution, I have seen you very active on spark user list please throw your
> thoughts on implementation. Below is the requirement.
>
>
>
> I have tried using dataset by splitting the double array column but it
> fails when double size grows. When I create the double array schema data
> type spark doesn’t allow me to sum them because it would be done only on
> numeric types. If I think about storing the file per combination wise to
> parquet there will be too much parquet files.
>
>
>
> *Input : * The input file will be like below in real data the attributes
> will be *20 *& the double array would be *50,000*
>
>
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> Now below are the possible combinations in above data set this will be all
> possible combinations
>
>
>
> 1.      Attribute_0, Attribute_1
>
> 2.      Attribute_0, Attribute_2
>
> 3.      Attribute_0, Attribute_3
>
> 4.      Attribute_1, Attribute_2
>
> 5.      Attribute_2, Attribute_3
>
> 6.      Attribute_1, Attribute_3
>
> 7.      Attribute_0, Attribute_1, Attribute_2
>
> 8.      Attribute_0, Attribute_1, Attribute_3
>
> 9.      Attribute_0, Attribute_2, Attribute_3
>
> 10.  Attribute_1, Attribute_2, Attribute_3
>
> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>
>
>
> Now we have to process all these combinations on input data preferably
> parallel to get good performance.
>
>
>
> *Attribute_0, Attribute_1*
>
>
>
> In this iteration the other attributes (*Attribute_2, Attribute_3*) are
> not required all we need is Attribute_0, Attribute_1 & double array
> columns. If you see the data there are two possible combination in the data
> one is 5_3 and other one is 3_2 we have to pick only those which has at
> least 2 combinations in real data we will get in thousands.
>
>
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> when we do the groupBy on above dataset with columns *Attribute_0,
> Attribute_1 *we will get two records with keys 5_3 & 3_2 and each key
> will have two double arrays.
>
>
>
> 5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454 & 0.8803063581705307  0.8101324740101096
> 0.48523937757683544  0.5897714618376072
>
>
>
> 3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575 & 0.33960064683141955  0.46537001358164043
> 0.543428826489435  0.42653939565053034
>
>
>
> now we have to add these double arrays index wise and produce the one array
>
>
>
> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319,
> 0.7698036740044117]
>
> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518,
> 0.6913180478781878]
>
>
>
> After adding we have to compute average, min, max etc on these vector and
> store the results against the keys.
>
>
>
> Same process will be repeated for next combinations.
>
>
>
>
>
>
>
> Thank you
>
> Anil Langote
>
> +1-425-633-9747
>
>
>

Re: DataSet is not able to handle 50,000 columns to sum

Reply via email to