Re: spark parquet too many small files ?
Hi Takeshi, I cant use coalesce in spark-sql shell right I know we can use coalesce in spark with scala application , here in my project we are not building jar or using python we are just executing hive query in spark-sql shell and submitting to yarn client . Example:- spark-sql --verbose --queue default --name wchargeback_event.sparksql.kali --master yarn-client --driver-memory 15g --executor-memory 15g --num-executors 10 --executor-cores 2 -f /x/home/pp_dt_fin_batch/users/ srtummala/run-spark/sql/wtr_full.sql --conf "spark.yarn.executor.memoryOverhead=8000" --conf "spark.sql.shuffle.partitions=50" --conf "spark.kyroserializer.buffer.max.mb=5g" --conf "spark.driver.maxResultSize=20g" --conf "spark.storage.memoryFraction=0.8" --conf "spark.hadoopConfiguration=2560" --conf "spark.dynamicAllocation.enabled=false$" --conf "spark.shuffle.service.enabled=false" --conf "spark.executor.instances=10" Thanks Sri On Sat, Jul 2, 2016 at 2:53 AM, Takeshi Yamamuro <linguin@gmail.com> wrote: > Please also see https://issues.apache.org/jira/browse/SPARK-16188. > > // maropu > > On Fri, Jul 1, 2016 at 7:39 PM, kali.tumm...@gmail.com < > kali.tumm...@gmail.com> wrote: > >> I found the jira for the issue will there be a fix in future ? or no fix ? >> >> https://issues.apache.org/jira/browse/SPARK-6221 >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > --- > Takeshi Yamamuro > -- Thanks & Regards Sri Tummala
Re: spark parquet too many small files ?
Please also see https://issues.apache.org/jira/browse/SPARK-16188. // maropu On Fri, Jul 1, 2016 at 7:39 PM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > I found the jira for the issue will there be a fix in future ? or no fix ? > > https://issues.apache.org/jira/browse/SPARK-6221 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro
Re: spark parquet too many small files ?
I found the jira for the issue will there be a fix in future ? or no fix ? https://issues.apache.org/jira/browse/SPARK-6221 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark parquet too many small files ?
Hi Neelesh, I told you in my emails it's not spark-Scala application , I am working on just spark SQL. I am launching spark-SQL shell and running my hive code inside spark SQL she'll. Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts fictions like collasece which is spark Scala function. What I am trying to do is below. from(select * from source_table where load_date="2016-09-23")a Insert overwrite table target_table Select * Thanks Sri Sent from my iPhone > On 1 Jul 2016, at 17:35, nsalian [via Apache Spark User List] > <ml-node+s1001560n27265...@n3.nabble.com> wrote: > > Hi Sri, > > Thanks for the question. > You can simply start by doing this in the initial stage: > > val sqlContext = new SQLContext(sc) > val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json > example here > > where the argument is the path to the file(s). This will reduce the > partitions. > You can proceed with repartitioning the data further on. The goal would be to > reduce the number of files in the end as you do a saveAsParquet. > > Hope that helps. > Neelesh S. Salian > Cloudera > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html > To unsubscribe from spark parquet too many small files ?, click here. > NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27266.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark parquet too many small files ?
Hi Sri, Thanks for the question. You can simply start by doing this in the initial stage: val sqlContext = new SQLContext(sc) val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json example here where the argument is the path to the file(s). This will reduce the partitions. You can proceed with repartitioning the data further on. The goal would be to reduce the number of files in the end as you do a saveAsParquet. Hope that helps. - Neelesh S. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
spark parquet too many small files ?
Hi All, I am running hive in spark-sql in yarn client mode, the sql is pretty simple load dynamic partitions to target parquet table. I used hive configurations parameters such as (set hive.merge.smallfiles.avgsize=25600;set hive.merge.size.per.task=256000;) which usually merges small files to 256mb block size these parameters are supported in spark-sql is there other way around to merge number of small parquet files to large one. if its a scala application I can use collasece() function or repartition but here we are not using spark-scala application its just plain spark-sql. Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org