Dear Sam,

you are assuming that the data fits in the memory of your local machine. You are using as a basis a dataframe, which potentially can be very large, and then you are storing the data in local lists. Keep in mind that that the number of distinct elements in a column may be very large (depending on the app). I suggest to work on a solution that assumes that the number of distinct values is also large. Thus, you should keep your data in dataframes or RDDs, and store them as csv files, parquet, etc.

a.p.


On 10/2/23 23:40, sam smith wrote:
I want to get the distinct values of each column in a List (is it good practice to use List here?), that contains as first element the column name, and the other element its distinct values so that for a dataset we get a list of lists, i do it this way (in my opinion no so fast):

|List<List<String>> finalList = new ArrayList<List<String>>(); Dataset<Row> df = spark.read().format("csv").option("header", "true").load("/pathToCSV"); String[] columnNames = df.columns(); for (int i=0;i<columnNames.length;i++) { List<String> columnList = new ArrayList<String>(); columnList.add(columnNames[i]); List<Row> columnValues = df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); for (int j=0;j<columnValues.size();j++) columnList.add(columnValues.get(j).apply(0).toString()); finalList.add(columnList);|

How to improve this?

Also, can I get the results in JSON format?

--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol

Reply via email to