Re: input file size

2022-06-19 Thread Gourav Sengupta
Hi,

Just so that we understand the intention why do you need to know the
file size? Are you not using splittable file format?

If you use spark streaming to read the files, using just once, then you
will be able to get the metadata of the files I believe.



Regards,
Gourav Sengupta

On Sun, Jun 19, 2022 at 8:00 AM Enrico Minack 
wrote:

> Given you already know your input files (input_file_name), why not getting
> their size and summing this up?
>
> import java.io.Fileimport java.net.URIimport 
> org.apache.spark.sql.functions.input_file_name
> ds.select(input_file_name.as("filename"))
>   .distinct.as[String]
>   .map(filename => new File(new URI(filename).getPath).length)
>   .select(sum($"value"))
>   .show()
>
>
> Enrico
>
>
> Am 19.06.22 um 03:16 schrieb Yong Walt:
>
> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = 
> someFile.length
>
> This one?
>
>
> On Sun, Jun 19, 2022 at 4:33 AM mbreuer  wrote:
>
>> Hello Community,
>>
>> I am working on optimizations for file sizes and number of files. In the
>> data frame there is a function input_file_name which returns the file
>> name. I miss a counterpart to get the size of the file. Just the size,
>> like "ls -l" returns. Is there something like that?
>>
>> Kind regards,
>> Markus
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: input file size

2022-06-19 Thread Enrico Minack

Maybe a

  .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else 
Iterator.empty)

might be faster than the

  .distinct.as[String]


Enrico


Am 19.06.22 um 08:59 schrieb Enrico Minack:
Given you already know your input files (input_file_name), why not 
getting their size and summing this up?


|import java.io.File ||import java.net.URI|
|import|  org.apache.spark.sql.functions.input_file_name

|ds.select(input_file_name.as("filename")) .distinct.as[String] 
.map(filename => new File(new URI(filename).getPath).length) 
.select(sum($"value")) .show()|

||

Enrico


Am 19.06.22 um 03:16 schrieb Yong Walt:
|import java.io.File val someFile = new File("somefile.txt") val 
fileSize = someFile.length|

This one?

On Sun, Jun 19, 2022 at 4:33 AM mbreuer  wrote:

Hello Community,

I am working on optimizations for file sizes and number of files.
In the
data frame there is a function input_file_name which returns the
file
name. I miss a counterpart to get the size of the file. Just the
size,
like "ls -l" returns. Is there something like that?

Kind regards,
Markus


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





Re: input file size

2022-06-19 Thread marc nicole
Reasoning in files (vs datasets as i first thought of this question), I
think this is more adequate in Spark:

> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null);

it will yield same result as

> new File("filePath").length();


Le dim. 19 juin 2022 à 11:11, Enrico Minack  a
écrit :

> Maybe a
>
>   .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else 
> Iterator.empty)
>
> might be faster than the
>
>   .distinct.as[String]
>
>
> Enrico
>
>
> Am 19.06.22 um 08:59 schrieb Enrico Minack:
>
> Given you already know your input files (input_file_name), why not getting
> their size and summing this up?
>
> import java.io.Fileimport java.net.URIimport 
> org.apache.spark.sql.functions.input_file_name
> ds.select(input_file_name.as("filename"))
>   .distinct.as[String]
>   .map(filename => new File(new URI(filename).getPath).length)
>   .select(sum($"value"))
>   .show()
>
>
> Enrico
>
>
> Am 19.06.22 um 03:16 schrieb Yong Walt:
>
> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = 
> someFile.length
>
> This one?
>
>
> On Sun, Jun 19, 2022 at 4:33 AM mbreuer  wrote:
>
>> Hello Community,
>>
>> I am working on optimizations for file sizes and number of files. In the
>> data frame there is a function input_file_name which returns the file
>> name. I miss a counterpart to get the size of the file. Just the size,
>> like "ls -l" returns. Is there something like that?
>>
>> Kind regards,
>> Markus
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


How reading works?

2022-06-19 Thread Sid
Hi,

I already have a partitioned JSON dataset in s3 like the below:

edl_timestamp=2022090800

Now, the problem is, in the earlier 10 days of data collection there was a
duplicate columns issue due to which we couldn't read the data.

Now the latest 10 days of data are proper. So, I am trying to do
something like the below:

spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)

but I am getting the issue of the duplicate column which was present in the
old dataset. So, I am trying to understand how the spark reads the data.
Does it full dataset and filter on the basis of the last saved timestamp or
does it filter only what is required? If the second case is true, then it
should have read the data since the latest data is correct.

So just trying to understand. Could anyone help here?

Thanks,
Sid


Re: input file size

2022-06-19 Thread Markus Breuer
Wow! That was exactly what I was looking for. I hadn't even thought of
File.length and thanks to your tips, the solution is now on a silver
platter in front of me.

Thank you very much!

Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole :

> Reasoning in files (vs datasets as i first thought of this question), I
> think this is more adequate in Spark:
>
>> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null);
>
> it will yield same result as
>
>> new File("filePath").length();
>
>
> Le dim. 19 juin 2022 à 11:11, Enrico Minack  a
> écrit :
>
>> Maybe a
>>
>>   .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else 
>> Iterator.empty)
>>
>> might be faster than the
>>
>>   .distinct.as[String]
>>
>>
>> Enrico
>>
>>
>> Am 19.06.22 um 08:59 schrieb Enrico Minack:
>>
>> Given you already know your input files (input_file_name), why not
>> getting their size and summing this up?
>>
>> import java.io.Fileimport java.net.URIimport 
>> org.apache.spark.sql.functions.input_file_name
>> ds.select(input_file_name.as("filename"))
>>   .distinct.as[String]
>>   .map(filename => new File(new URI(filename).getPath).length)
>>   .select(sum($"value"))
>>   .show()
>>
>>
>> Enrico
>>
>>
>> Am 19.06.22 um 03:16 schrieb Yong Walt:
>>
>> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = 
>> someFile.length
>>
>> This one?
>>
>>
>> On Sun, Jun 19, 2022 at 4:33 AM mbreuer  wrote:
>>
>>> Hello Community,
>>>
>>> I am working on optimizations for file sizes and number of files. In the
>>> data frame there is a function input_file_name which returns the file
>>> name. I miss a counterpart to get the size of the file. Just the size,
>>> like "ls -l" returns. Is there something like that?
>>>
>>> Kind regards,
>>> Markus
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>