I need to create a single table by selecting one column from thousands of files. The columns are all of the same type, have the same number of rows and rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 TB.
Does spark have something like cbind() “Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. “ https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind Digging through the spark documentation I found a udf example https://spark.apache.org/docs/latest/sparkr.html#dapply ``` # Convert waiting time from hours to seconds. # Note that we can apply UDF to DataFrame. schema <- structType(structField("eruptions", "double"), structField("waiting", "double"), structField("waiting_secs", "double")) df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema) head(collect(df1)) ## eruptions waiting waiting_secs ##1 3.600 79 4740 ##2 1.800 54 3240 ##3 3.333 74 4440 ##4 2.283 62 3720 ##5 4.533 85 5100 ##6 2.883 55 3300 ``` I wonder if this is just a wrapper around join? If so it is probably not going to help me out. Also I would prefer to work in python Any thoughts? Kind regards Andy