Hi
Sending it to the dev list as per Matthias suggestions
Alok
----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM
-----
From: Matthias Boehm/Almaden/IBM
To: Alok Singh/San Francisco/IBM@IBMUS
Cc: Arvind Surve/San Jose/IBM@IBMUS
Date: 05/23/2016 09:02 PM
Subject: Re: Questions/query about recode / transform in systemML
Hi Alok,
would you mind posting this question on our dev mailing list such that
other people also benefit from it? Thanks.
Regards,
Matthias
From: Alok Singh/San Francisco/IBM
To: Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS
Date: 05/23/2016 07:19 PM
Subject: Questions/query about recode / transform in systemML
Hi Matthias and Arvind.
I had the questions about the internals and how the scan happens in
systemML transform
Question 1
Lets consider an example of dataframe as follows (first line is schema)
userID , county, state
================
1, sanJose,CA
2, santaClara,CA
3,sanJose,CA
4,alameda,CA
5,minnepolis,MN
we can see that uniq for county is {sanJose, alameda, minnepolis} and for
state is {CA,MN}
so example as the doc at
http://apache.github.io/incubator-systemml/files/dml-language-reference/data.spec.json
user pass in the spec file as
"recode": ["country", "state"]
then the question is how many passes systemML will make for the dataframe
.i.e in general the recode algo would be
for column in columns:
step 1) find uniq for the column
step 2) apply recode value for column
so does it mean , we would need 2*count(columns) pass on the dataframe?
if not , then how systemML internally doesn't do more than
2*count(columns)?
Question 2
Lets consider another dataframe as follows (first line is schema)
random_string
===========
col1
dsfsdf
xcvxcv
sdf
etc
foo
Dummy
we can definitely see that number of unique for this df will be almost
same as number of rows
and what if number of rows is 10 trillion and also number of unique for
column random_string is 10 trillion .
in that case, the whole uniq data will not fit in the one node. so in that
case how does systemML handle it?
Thanks for the inputs
Alok