Wow! That was exactly what I was looking for. I hadn't even thought of File.length and thanks to your tips, the solution is now on a silver platter in front of me.
Thank you very much! Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole <mk1853...@gmail.com >: > Reasoning in files (vs datasets as i first thought of this question), I > think this is more adequate in Spark: > >> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); > > it will yield same result as > >> new File("filePath").length(); > > > Le dim. 19 juin 2022 à 11:11, Enrico Minack <i...@enrico.minack.dev> a > écrit : > >> Maybe a >> >> .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else >> Iterator.empty) >> >> might be faster than the >> >> .distinct.as[String] >> >> >> Enrico >> >> >> Am 19.06.22 um 08:59 schrieb Enrico Minack: >> >> Given you already know your input files (input_file_name), why not >> getting their size and summing this up? >> >> import java.io.Fileimport java.net.URIimport >> org.apache.spark.sql.functions.input_file_name >> ds.select(input_file_name.as("filename")) >> .distinct.as[String] >> .map(filename => new File(new URI(filename).getPath).length) >> .select(sum($"value")) >> .show() >> >> >> Enrico >> >> >> Am 19.06.22 um 03:16 schrieb Yong Walt: >> >> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = >> someFile.length >> >> This one? >> >> >> On Sun, Jun 19, 2022 at 4:33 AM mbreuer <msbre...@gmail.com> wrote: >> >>> Hello Community, >>> >>> I am working on optimizations for file sizes and number of files. In the >>> data frame there is a function input_file_name which returns the file >>> name. I miss a counterpart to get the size of the file. Just the size, >>> like "ls -l" returns. Is there something like that? >>> >>> Kind regards, >>> Markus >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >>