Re: input file size

2022-06-19 Thread Markus Breuer
Wow! That was exactly what I was looking for. I hadn't even thought of File.length and thanks to your tips, the solution is now on a silver platter in front of me. Thank you very much! Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole : > Reasoning in files (vs datasets as i first thought

Re: input file size

2022-06-19 Thread marc nicole
Reasoning in files (vs datasets as i first thought of this question), I think this is more adequate in Spark: > org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); it will yield same result as > new File("filePath").length(); Le dim. 19 juin 2022 à 11:11, Enrico Minack a

Re: input file size

2022-06-19 Thread Enrico Minack
Maybe a   .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else Iterator.empty) might be faster than the   .distinct.as[String] Enrico Am 19.06.22 um 08:59 schrieb Enrico Minack: Given you already know your input files (input_file_name), why not getting their size and

Re: input file size

2022-06-19 Thread Gourav Sengupta
Hi, Just so that we understand the intention why do you need to know the file size? Are you not using splittable file format? If you use spark streaming to read the files, using just once, then you will be able to get the metadata of the files I believe. Regards, Gourav Sengupta On Sun, Jun

Re: input file size

2022-06-19 Thread Enrico Minack
Given you already know your input files (input_file_name), why not getting their size and summing this up? |import java.io.File ||import java.net.URI| |import| org.apache.spark.sql.functions.input_file_name |ds.select(input_file_name.as("filename")) .distinct.as[String] .map(filename => new

Re: input file size

2022-06-18 Thread Yong Walt
import java.io.Fileval someFile = new File("somefile.txt")val fileSize = someFile.length This one? On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > Hello Community, > > I am working on optimizations for file sizes and number of files. In the > data frame there is a function input_file_name

Re: input file size

2022-06-18 Thread marc nicole
Hi, I found this ( https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html) that may be helpful, i use Java: > org.apache.spark.util.SizeEstimator.estimate(dataset)); Le sam. 18 juin 2022 à 22:33, mbreuer a écrit : > Hello Community, > > I am working on

input file size

2022-06-18 Thread mbreuer
Hello Community, I am working on optimizations for file sizes and number of files. In the data frame there is a function input_file_name which returns the file name. I miss a counterpart to get the size of the file. Just the size, like "ls -l" returns. Is there something like that? Kind

Spark 1.3.0: ExecutorLostFailure depending on input file size

2015-08-13 Thread Wyss Michael (wysm)
Hi I've been at this problem for a few days now and wasn't able to solve it. I'm hoping that I'm missing something that you don't! I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a