I need to create a single table by selecting one column from thousands of 
files. The columns are all of the same type, have the same number of rows and 
rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 
TB.

Does spark have something like cbind() “Take a sequence of vector, matrix or 
data-frame arguments and combine by columns or rows, respectively. “

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind

Digging through the spark documentation I found a udf example
https://spark.apache.org/docs/latest/sparkr.html#dapply

```
# Convert waiting time from hours to seconds.
# Note that we can apply UDF to DataFrame.
schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
                     structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)
head(collect(df1))
##  eruptions waiting waiting_secs
##1     3.600      79         4740
##2     1.800      54         3240
##3     3.333      74         4440
##4     2.283      62         3720
##5     4.533      85         5100
##6     2.883      55         3300
```

I wonder if this is just a wrapper around join? If so it is probably not going 
to help me out.

Also I would prefer to work in python

Any thoughts?

Kind regards

Andy


Reply via email to