Re: input file size
Hi, Just so that we understand the intention why do you need to know the file size? Are you not using splittable file format? If you use spark streaming to read the files, using just once, then you will be able to get the metadata of the files I believe. Regards, Gourav Sengupta On Sun, Jun 19, 2022 at 8:00 AM Enrico Minack wrote: > Given you already know your input files (input_file_name), why not getting > their size and summing this up? > > import java.io.Fileimport java.net.URIimport > org.apache.spark.sql.functions.input_file_name > ds.select(input_file_name.as("filename")) > .distinct.as[String] > .map(filename => new File(new URI(filename).getPath).length) > .select(sum($"value")) > .show() > > > Enrico > > > Am 19.06.22 um 03:16 schrieb Yong Walt: > > import java.io.Fileval someFile = new File("somefile.txt")val fileSize = > someFile.length > > This one? > > > On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > >> Hello Community, >> >> I am working on optimizations for file sizes and number of files. In the >> data frame there is a function input_file_name which returns the file >> name. I miss a counterpart to get the size of the file. Just the size, >> like "ls -l" returns. Is there something like that? >> >> Kind regards, >> Markus >> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: input file size
Maybe a .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else Iterator.empty) might be faster than the .distinct.as[String] Enrico Am 19.06.22 um 08:59 schrieb Enrico Minack: Given you already know your input files (input_file_name), why not getting their size and summing this up? |import java.io.File ||import java.net.URI| |import| org.apache.spark.sql.functions.input_file_name |ds.select(input_file_name.as("filename")) .distinct.as[String] .map(filename => new File(new URI(filename).getPath).length) .select(sum($"value")) .show()| || Enrico Am 19.06.22 um 03:16 schrieb Yong Walt: |import java.io.File val someFile = new File("somefile.txt") val fileSize = someFile.length| This one? On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: Hello Community, I am working on optimizations for file sizes and number of files. In the data frame there is a function input_file_name which returns the file name. I miss a counterpart to get the size of the file. Just the size, like "ls -l" returns. Is there something like that? Kind regards, Markus - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: input file size
Reasoning in files (vs datasets as i first thought of this question), I think this is more adequate in Spark: > org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); it will yield same result as > new File("filePath").length(); Le dim. 19 juin 2022 à 11:11, Enrico Minack a écrit : > Maybe a > > .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else > Iterator.empty) > > might be faster than the > > .distinct.as[String] > > > Enrico > > > Am 19.06.22 um 08:59 schrieb Enrico Minack: > > Given you already know your input files (input_file_name), why not getting > their size and summing this up? > > import java.io.Fileimport java.net.URIimport > org.apache.spark.sql.functions.input_file_name > ds.select(input_file_name.as("filename")) > .distinct.as[String] > .map(filename => new File(new URI(filename).getPath).length) > .select(sum($"value")) > .show() > > > Enrico > > > Am 19.06.22 um 03:16 schrieb Yong Walt: > > import java.io.Fileval someFile = new File("somefile.txt")val fileSize = > someFile.length > > This one? > > > On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > >> Hello Community, >> >> I am working on optimizations for file sizes and number of files. In the >> data frame there is a function input_file_name which returns the file >> name. I miss a counterpart to get the size of the file. Just the size, >> like "ls -l" returns. Is there something like that? >> >> Kind regards, >> Markus >> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > >
How reading works?
Hi, I already have a partitioned JSON dataset in s3 like the below: edl_timestamp=2022090800 Now, the problem is, in the earlier 10 days of data collection there was a duplicate columns issue due to which we couldn't read the data. Now the latest 10 days of data are proper. So, I am trying to do something like the below: spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) but I am getting the issue of the duplicate column which was present in the old dataset. So, I am trying to understand how the spark reads the data. Does it full dataset and filter on the basis of the last saved timestamp or does it filter only what is required? If the second case is true, then it should have read the data since the latest data is correct. So just trying to understand. Could anyone help here? Thanks, Sid
Re: input file size
Wow! That was exactly what I was looking for. I hadn't even thought of File.length and thanks to your tips, the solution is now on a silver platter in front of me. Thank you very much! Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole : > Reasoning in files (vs datasets as i first thought of this question), I > think this is more adequate in Spark: > >> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); > > it will yield same result as > >> new File("filePath").length(); > > > Le dim. 19 juin 2022 à 11:11, Enrico Minack a > écrit : > >> Maybe a >> >> .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else >> Iterator.empty) >> >> might be faster than the >> >> .distinct.as[String] >> >> >> Enrico >> >> >> Am 19.06.22 um 08:59 schrieb Enrico Minack: >> >> Given you already know your input files (input_file_name), why not >> getting their size and summing this up? >> >> import java.io.Fileimport java.net.URIimport >> org.apache.spark.sql.functions.input_file_name >> ds.select(input_file_name.as("filename")) >> .distinct.as[String] >> .map(filename => new File(new URI(filename).getPath).length) >> .select(sum($"value")) >> .show() >> >> >> Enrico >> >> >> Am 19.06.22 um 03:16 schrieb Yong Walt: >> >> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = >> someFile.length >> >> This one? >> >> >> On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: >> >>> Hello Community, >>> >>> I am working on optimizations for file sizes and number of files. In the >>> data frame there is a function input_file_name which returns the file >>> name. I miss a counterpart to get the size of the file. Just the size, >>> like "ls -l" returns. Is there something like that? >>> >>> Kind regards, >>> Markus >>> >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >>