Re: Spark Small file issue
All 800 files(in a partition folder) sizes are in bytes. It will sum up to 200 MB which is each partition folder input size. And I am using ORC format. Never used Parquet format. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Small file issue
So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB. If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and write it out in 13 extra mins feels actually really good for spark. You are writing out roughly about 130 MB/sec of compressed parquet data. It has been a little while since I benchmarked it, but that feels about the right order of magnitude. I would suggest that you try repartitioning it to 10 threads or 100 threads instead. On Tue, Jun 23, 2020 at 4:54 PM Hichki wrote: > Hello Team, > > > > I am new to Spark environment. I have converted Hive query to Spark Scala. > Now I am loading data and doing performance testing. Below are details on > loading 3 weeks data. Cluster level small file avg size is set to 128 MB. > > > > 1. New temp table where I am loading data is ORC formatted as current Hive > table is ORC stored. > > 2. Hive table each partition folder size is 200 MB. > > 3. I am using repartition(1) in spark code so that it will create one 200MB > part file in each partition folder(to avoid small file issue). With this > job > is completing in 23 to 26 mins. > > 4. If I don't use repartition(), job is completing in 12 to 13 mins. But > problem with this approach is, it is creating 800 part files (size <128MB) > in each partition folder. > > > > I am quite not sure on how to reduce processing time and not create small > files at the same time. Could anyone please help me in this situation. > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Spark Small file issue
Hi, I am doing repartition at the end. I mean before insert overwriting the table. I see the last step (repartition) is taking more time. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Small file issue
i second that. we have gotten bitten too many times by coalesce impacting upstream in an unintended way that i avoid coalesce on write altogether. i prefer to use repartition (and take the shuffle hit) before writing (especially if you are writing out partitioned), or if possible use adaptive query execution to avoid too many files to begin with On Wed, Jun 24, 2020 at 9:09 AM Bobby Evans wrote: > First, you need to be careful with coalesce. It will impact upstream > processing, so if you are doing a lot of computation in the last stage > before the repartition then coalesce will make the problem worse because > all of that computation will happen in a single thread instead of being > spread out. > > My guess is that it has something to do with writing your output files. > Writing orc and/or parquet is not cheap. It does a lot of compression and > statistics calculations. I also am not sure why, but from what I have seen > they do not scale very linearly with more data being put into a single > file. You might also be doing the repartition too early. There should be > some statistics on the SQL page of the UI where you can look to see which > stages took a long time it should point you in the right direction. > > On Tue, Jun 23, 2020 at 5:06 PM German SM > wrote: > >> Hi, >> >> When reducing partitions is better to use coalesce because it doesn't >> need to shuffle the data. >> >> dataframe.coalesce(1) >> >> El mar., 23 jun. 2020 23:54, Hichki escribió: >> >>> Hello Team, >>> >>> >>> >>> I am new to Spark environment. I have converted Hive query to Spark >>> Scala. >>> Now I am loading data and doing performance testing. Below are details on >>> loading 3 weeks data. Cluster level small file avg size is set to 128 >>> MB. >>> >>> >>> >>> 1. New temp table where I am loading data is ORC formatted as current >>> Hive >>> table is ORC stored. >>> >>> 2. Hive table each partition folder size is 200 MB. >>> >>> 3. I am using repartition(1) in spark code so that it will create one >>> 200MB >>> part file in each partition folder(to avoid small file issue). With this >>> job >>> is completing in 23 to 26 mins. >>> >>> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But >>> problem with this approach is, it is creating 800 part files (size >>> <128MB) >>> in each partition folder. >>> >>> >>> >>> I am quite not sure on how to reduce processing time and not create small >>> files at the same time. Could anyone please help me in this situation. >>> >>> >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>>
Re: Spark Small file issue
First, you need to be careful with coalesce. It will impact upstream processing, so if you are doing a lot of computation in the last stage before the repartition then coalesce will make the problem worse because all of that computation will happen in a single thread instead of being spread out. My guess is that it has something to do with writing your output files. Writing orc and/or parquet is not cheap. It does a lot of compression and statistics calculations. I also am not sure why, but from what I have seen they do not scale very linearly with more data being put into a single file. You might also be doing the repartition too early. There should be some statistics on the SQL page of the UI where you can look to see which stages took a long time it should point you in the right direction. On Tue, Jun 23, 2020 at 5:06 PM German SM wrote: > Hi, > > When reducing partitions is better to use coalesce because it doesn't need > to shuffle the data. > > dataframe.coalesce(1) > > El mar., 23 jun. 2020 23:54, Hichki escribió: > >> Hello Team, >> >> >> >> I am new to Spark environment. I have converted Hive query to Spark Scala. >> Now I am loading data and doing performance testing. Below are details on >> loading 3 weeks data. Cluster level small file avg size is set to 128 MB. >> >> >> >> 1. New temp table where I am loading data is ORC formatted as current Hive >> table is ORC stored. >> >> 2. Hive table each partition folder size is 200 MB. >> >> 3. I am using repartition(1) in spark code so that it will create one >> 200MB >> part file in each partition folder(to avoid small file issue). With this >> job >> is completing in 23 to 26 mins. >> >> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But >> problem with this approach is, it is creating 800 part files (size <128MB) >> in each partition folder. >> >> >> >> I am quite not sure on how to reduce processing time and not create small >> files at the same time. Could anyone please help me in this situation. >> >> >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: Spark Small file issue
Hi, When reducing partitions is better to use coalesce because it doesn't need to shuffle the data. dataframe.coalesce(1) El mar., 23 jun. 2020 23:54, Hichki escribió: > Hello Team, > > > > I am new to Spark environment. I have converted Hive query to Spark Scala. > Now I am loading data and doing performance testing. Below are details on > loading 3 weeks data. Cluster level small file avg size is set to 128 MB. > > > > 1. New temp table where I am loading data is ORC formatted as current Hive > table is ORC stored. > > 2. Hive table each partition folder size is 200 MB. > > 3. I am using repartition(1) in spark code so that it will create one 200MB > part file in each partition folder(to avoid small file issue). With this > job > is completing in 23 to 26 mins. > > 4. If I don't use repartition(), job is completing in 12 to 13 mins. But > problem with this approach is, it is creating 800 part files (size <128MB) > in each partition folder. > > > > I am quite not sure on how to reduce processing time and not create small > files at the same time. Could anyone please help me in this situation. > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >