Wow! That was exactly what I was looking for. I hadn't even thought of
File.length and thanks to your tips, the solution is now on a silver
platter in front of me.

Thank you very much!

Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole <mk1853...@gmail.com
>:

> Reasoning in files (vs datasets as i first thought of this question), I
> think this is more adequate in Spark:
>
>> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null);
>
> it will yield same result as
>
>> new File("filePath").length();
>
>
> Le dim. 19 juin 2022 à 11:11, Enrico Minack <i...@enrico.minack.dev> a
> écrit :
>
>> Maybe a
>>
>>   .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else 
>> Iterator.empty)
>>
>> might be faster than the
>>
>>   .distinct.as[String]
>>
>>
>> Enrico
>>
>>
>> Am 19.06.22 um 08:59 schrieb Enrico Minack:
>>
>> Given you already know your input files (input_file_name), why not
>> getting their size and summing this up?
>>
>> import java.io.Fileimport java.net.URIimport 
>> org.apache.spark.sql.functions.input_file_name
>> ds.select(input_file_name.as("filename"))
>>   .distinct.as[String]
>>   .map(filename => new File(new URI(filename).getPath).length)
>>   .select(sum($"value"))
>>   .show()
>>
>>
>> Enrico
>>
>>
>> Am 19.06.22 um 03:16 schrieb Yong Walt:
>>
>> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = 
>> someFile.length
>>
>> This one?
>>
>>
>> On Sun, Jun 19, 2022 at 4:33 AM mbreuer <msbre...@gmail.com> wrote:
>>
>>> Hello Community,
>>>
>>> I am working on optimizations for file sizes and number of files. In the
>>> data frame there is a function input_file_name which returns the file
>>> name. I miss a counterpart to get the size of the file. Just the size,
>>> like "ls -l" returns. Is there something like that?
>>>
>>> Kind regards,
>>> Markus
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>

Reply via email to