transform in systemML

Matthias Boehm Tue, 24 May 2016 00:19:49 -0700

thanks for the question Alok - couple of comments:

Q1) Independent of the number of columns, we will always do two passes, one
to compute the recode maps (distinct values) and one to apply the recode
maps to your data.


Q2) For the distributed case, we have right now only broadcast-based
transform apply operators. This means it will run out of memory/into errors
if the recode maps do not fit into MR tasks or Spark's broadcast buffers
(2GB because recode maps are not partitioned). However, note that we're
currently in the process of adding native support for frames (see
SYSTEMML-554) - as part of it, we'll also change transform to exploit the
distributed frame representations (SYSTEMML-569), which will already remove
some of the existing restrictions. Further fully distributed transform
operators are certainly possible too (via join-based plans).

Regards,
Matthias



From:   Alok Singh/San Francisco/IBM@IBMUS
To:     [email protected]
Date:   05/23/2016 10:32 PM
Subject:        Fw: Questions/query about recode / transform in systemML



Hi

Sending it to the dev list as per Matthias suggestions

Alok

----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM
-----

From:   Matthias Boehm/Almaden/IBM
To:     Alok Singh/San Francisco/IBM@IBMUS
Cc:     Arvind Surve/San Jose/IBM@IBMUS
Date:   05/23/2016 09:02 PM
Subject:        Re: Questions/query about recode / transform in systemML


Hi Alok,

would you mind posting this question on our dev mailing list such that
other people also benefit from it? Thanks.


Regards,
Matthias



From:   Alok Singh/San Francisco/IBM
To:     Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS
Date:   05/23/2016 07:19 PM
Subject:        Questions/query about recode / transform in systemML



Hi Matthias and Arvind.

I had the questions about the internals and how the scan happens in
systemML transform


Question 1

Lets consider an example of dataframe as follows (first line is schema)

userID , county, state
================
1, sanJose,CA
2, santaClara,CA
3,sanJose,CA
4,alameda,CA
5,minnepolis,MN


we can see that uniq for county is {sanJose, alameda, minnepolis} and for
state is {CA,MN}

so example as the doc at
http://apache.github.io/incubator-systemml/files/dml-language-reference/data.spec.json


user pass in the spec file as
"recode": ["country", "state"]

then the question is how many passes systemML will make for the dataframe
.i.e in general the recode algo would be

for  column  in columns:
   step 1) find uniq for the column

   step 2) apply recode value  for column


so does it mean , we would need 2*count(columns) pass on the dataframe?

if not , then how systemML internally doesn't do more than
2*count(columns)?

Question 2

Lets consider another dataframe as follows (first line is schema)

random_string
===========
col1
dsfsdf
xcvxcv
sdf
etc
foo
Dummy

we can definitely see that number of unique for this df will be almost
same as number of rows
and what if number of rows is 10 trillion and also number of unique for
column random_string is 10 trillion .
in that case, the whole uniq data will not fit in the one node. so in that
case how does systemML handle it?


Thanks for the inputs
Alok

Re: Fw: Questions/query about recode / transform in systemML

Reply via email to