Wow! That was exactly what I was looking for. I hadn't even thought of
File.length and thanks to your tips, the solution is now on a silver
platter in front of me.
Thank you very much!
Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole :
> Reasoning in files (vs datasets as i first thought
Reasoning in files (vs datasets as i first thought of this question), I
think this is more adequate in Spark:
> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null);
it will yield same result as
> new File("filePath").length();
Le dim. 19 juin 2022 à 11:11, Enrico Minack a
Maybe a
.as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else
Iterator.empty)
might be faster than the
.distinct.as[String]
Enrico
Am 19.06.22 um 08:59 schrieb Enrico Minack:
Given you already know your input files (input_file_name), why not
getting their size and
Hi,
Just so that we understand the intention why do you need to know the
file size? Are you not using splittable file format?
If you use spark streaming to read the files, using just once, then you
will be able to get the metadata of the files I believe.
Regards,
Gourav Sengupta
On Sun, Jun
Given you already know your input files (input_file_name), why not
getting their size and summing this up?
|import java.io.File ||import java.net.URI|
|import| org.apache.spark.sql.functions.input_file_name
|ds.select(input_file_name.as("filename")) .distinct.as[String]
.map(filename => new
import java.io.Fileval someFile = new File("somefile.txt")val fileSize
= someFile.length
This one?
On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote:
> Hello Community,
>
> I am working on optimizations for file sizes and number of files. In the
> data frame there is a function input_file_name
Hi,
I found this (
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html)
that may be helpful, i use Java:
> org.apache.spark.util.SizeEstimator.estimate(dataset));
Le sam. 18 juin 2022 à 22:33, mbreuer a écrit :
> Hello Community,
>
> I am working on
Hello Community,
I am working on optimizations for file sizes and number of files. In the
data frame there is a function input_file_name which returns the file
name. I miss a counterpart to get the size of the file. Just the size,
like "ls -l" returns. Is there something like that?
Kind
Hi
I've been at this problem for a few days now and wasn't able to solve it.
I'm hoping that I'm missing something that you don't!
I'm trying to run a simple python application on a 2-node-cluster I set up
in standalone mode. A master and a worker, whereas the master also takes on
the role of a