Hi Sending it to the dev list as per Matthias suggestions
Alok ----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM ----- From: Matthias Boehm/Almaden/IBM To: Alok Singh/San Francisco/IBM@IBMUS Cc: Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 09:02 PM Subject: Re: Questions/query about recode / transform in systemML Hi Alok, would you mind posting this question on our dev mailing list such that other people also benefit from it? Thanks. Regards, Matthias From: Alok Singh/San Francisco/IBM To: Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 07:19 PM Subject: Questions/query about recode / transform in systemML Hi Matthias and Arvind. I had the questions about the internals and how the scan happens in systemML transform Question 1 Lets consider an example of dataframe as follows (first line is schema) userID , county, state ================ 1, sanJose,CA 2, santaClara,CA 3,sanJose,CA 4,alameda,CA 5,minnepolis,MN we can see that uniq for county is {sanJose, alameda, minnepolis} and for state is {CA,MN} so example as the doc at http://apache.github.io/incubator-systemml/files/dml-language-reference/data.spec.json user pass in the spec file as "recode": ["country", "state"] then the question is how many passes systemML will make for the dataframe .i.e in general the recode algo would be for column in columns: step 1) find uniq for the column step 2) apply recode value for column so does it mean , we would need 2*count(columns) pass on the dataframe? if not , then how systemML internally doesn't do more than 2*count(columns)? Question 2 Lets consider another dataframe as follows (first line is schema) random_string =========== col1 dsfsdf xcvxcv sdf etc foo Dummy we can definitely see that number of unique for this df will be almost same as number of rows and what if number of rows is 10 trillion and also number of unique for column random_string is 10 trillion . in that case, the whole uniq data will not fit in the one node. so in that case how does systemML handle it? Thanks for the inputs Alok