Re: How to improve efficiency of this piece of code (returning distinct column values)

Apostolos N. Papadopoulos Fri, 10 Feb 2023 13:46:58 -0800

Dear Sam,

you are assuming that the data fits in the memory of your local machine.You are using as a basis a dataframe, which potentially can be verylarge, and then you are storing the data in local lists. Keep in mindthat that the number of distinct elements in a column may be very large(depending on the app). I suggest to work on a solution that assumesthat the number of distinct values is also large. Thus, you should keepyour data in dataframes or RDDs, and store them as csv files, parquet, etc.


a.p.


On 10/2/23 23:40, sam smith wrote:

I want to get the distinct values of each column in a List (is it goodpractice to use List here?), that contains as first element the columnname, and the other element its distinct values so that for a datasetwe get a list of lists, i do it this way (in my opinion no so fast):
|List<List<String>> finalList = new ArrayList<List<String>>();Dataset<Row> df = spark.read().format("csv").option("header","true").load("/pathToCSV"); String[] columnNames = df.columns(); for(int i=0;i<columnNames.length;i++) { List<String> columnList = newArrayList<String>(); columnList.add(columnNames[i]); List<Row>columnValues =df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();for (int j=0;j<columnValues.size();j++)columnList.add(columnValues.get(j).apply(0).toString());finalList.add(columnList);|
How to improve this?

Also, can I get the results in JSON format?


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to