thanks for the question Alok - couple of comments: Q1) Independent of the number of columns, we will always do two passes, one to compute the recode maps (distinct values) and one to apply the recode maps to your data.
Q2) For the distributed case, we have right now only broadcast-based transform apply operators. This means it will run out of memory/into errors if the recode maps do not fit into MR tasks or Spark's broadcast buffers (2GB because recode maps are not partitioned). However, note that we're currently in the process of adding native support for frames (see SYSTEMML-554) - as part of it, we'll also change transform to exploit the distributed frame representations (SYSTEMML-569), which will already remove some of the existing restrictions. Further fully distributed transform operators are certainly possible too (via join-based plans). Regards, Matthias From: Alok Singh/San Francisco/IBM@IBMUS To: [email protected] Date: 05/23/2016 10:32 PM Subject: Fw: Questions/query about recode / transform in systemML Hi Sending it to the dev list as per Matthias suggestions Alok ----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM ----- From: Matthias Boehm/Almaden/IBM To: Alok Singh/San Francisco/IBM@IBMUS Cc: Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 09:02 PM Subject: Re: Questions/query about recode / transform in systemML Hi Alok, would you mind posting this question on our dev mailing list such that other people also benefit from it? Thanks. Regards, Matthias From: Alok Singh/San Francisco/IBM To: Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 07:19 PM Subject: Questions/query about recode / transform in systemML Hi Matthias and Arvind. I had the questions about the internals and how the scan happens in systemML transform Question 1 Lets consider an example of dataframe as follows (first line is schema) userID , county, state ================ 1, sanJose,CA 2, santaClara,CA 3,sanJose,CA 4,alameda,CA 5,minnepolis,MN we can see that uniq for county is {sanJose, alameda, minnepolis} and for state is {CA,MN} so example as the doc at http://apache.github.io/incubator-systemml/files/dml-language-reference/data.spec.json user pass in the spec file as "recode": ["country", "state"] then the question is how many passes systemML will make for the dataframe .i.e in general the recode algo would be for column in columns: step 1) find uniq for the column step 2) apply recode value for column so does it mean , we would need 2*count(columns) pass on the dataframe? if not , then how systemML internally doesn't do more than 2*count(columns)? Question 2 Lets consider another dataframe as follows (first line is schema) random_string =========== col1 dsfsdf xcvxcv sdf etc foo Dummy we can definitely see that number of unique for this df will be almost same as number of rows and what if number of rows is 10 trillion and also number of unique for column random_string is 10 trillion . in that case, the whole uniq data will not fit in the one node. so in that case how does systemML handle it? Thanks for the inputs Alok
