subject:"spark parquet too many small files \?"

Re: spark parquet too many small files ?

2016-07-02 Thread sri hari kali charan Tummala

Hi Takeshi,

I cant use coalesce in spark-sql shell right I know we can use coalesce in
spark with scala application , here in my project we are not building jar
or using python we are just executing hive query in spark-sql shell and
submitting to yarn client .

Example:-
spark-sql --verbose --queue default --name wchargeback_event.sparksql.kali
--master yarn-client --driver-memory 15g --executor-memory 15g
--num-executors 10 --executor-cores 2 -f /x/home/pp_dt_fin_batch/users/
srtummala/run-spark/sql/wtr_full.sql --conf
"spark.yarn.executor.memoryOverhead=8000"
--conf "spark.sql.shuffle.partitions=50" --conf
"spark.kyroserializer.buffer.max.mb=5g" --conf "spark.driver.maxResultSize=20g"
--conf "spark.storage.memoryFraction=0.8" --conf
"spark.hadoopConfiguration=2560"
--conf "spark.dynamicAllocation.enabled=false$" --conf
"spark.shuffle.service.enabled=false" --conf "spark.executor.instances=10"

Thanks
Sri




On Sat, Jul 2, 2016 at 2:53 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Please also see https://issues.apache.org/jira/browse/SPARK-16188.
>
> // maropu
>
> On Fri, Jul 1, 2016 at 7:39 PM, kali.tumm...@gmail.com <
> kali.tumm...@gmail.com> wrote:
>
>> I found the jira for the issue will there be a fix in future ? or no fix ?
>>
>> https://issues.apache.org/jira/browse/SPARK-6221
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Thanks & Regards
Sri Tummala

Re: spark parquet too many small files ?

2016-07-02 Thread Takeshi Yamamuro

Please also see https://issues.apache.org/jira/browse/SPARK-16188.

// maropu

On Fri, Jul 1, 2016 at 7:39 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> I found the jira for the issue will there be a fix in future ? or no fix ?
>
> https://issues.apache.org/jira/browse/SPARK-6221
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com

I found the jira for the issue will there be a fix in future ? or no fix ?

https://issues.apache.org/jira/browse/SPARK-6221



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com

Hi Neelesh,

I told you in my emails it's not spark-Scala application , I am working on just 
spark SQL.

I am launching spark-SQL shell and running my hive code inside spark SQL she'll.

Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts 
fictions like collasece which is spark Scala function.

What I am trying to do is below.

from(select * from source_table where load_date="2016-09-23")a
Insert overwrite table target_table Select * 


Thanks
Sri

Sent from my iPhone

> On 1 Jul 2016, at 17:35, nsalian [via Apache Spark User List] 
> <ml-node+s1001560n27265...@n3.nabble.com> wrote:
> 
> Hi Sri, 
> 
> Thanks for the question. 
> You can simply start by doing this in the initial stage: 
> 
> val sqlContext = new SQLContext(sc) 
> val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json 
> example here 
> 
> where the argument is the path to the file(s). This will reduce the 
> partitions. 
> You can proceed with repartitioning the data further on. The goal would be to 
> reduce the number of files in the end as you do a saveAsParquet. 
> 
> Hope that helps.
> Neelesh S. Salian 
> Cloudera
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
> To unsubscribe from spark parquet too many small files ?, click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27266.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark parquet too many small files ?

2016-07-01 Thread nsalian

Hi Sri,

Thanks for the question.
You can simply start by doing this in the initial stage:

val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here

where the argument is the path to the file(s). This will reduce the
partitions.
You can proceed with repartitioning the data further on. The goal would be
to reduce the number of files in the end as you do a saveAsParquet.

Hope that helps.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com

Hi All, 

I am running hive in spark-sql in yarn client mode, the sql is pretty simple
load dynamic partitions to target parquet table.

I used hive configurations parameters such as  (set
hive.merge.smallfiles.avgsize=25600;set
hive.merge.size.per.task=256000;) which usually merges small files to
256mb block size these parameters are supported in spark-sql is there other
way around to merge number of small parquet files to large one.

if its a scala application I can use collasece() function or repartition but
here we are not using spark-scala application its just plain spark-sql.


Thanks
Sri 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark parquet too many small files ?

Re: spark parquet too many small files ?

Re: spark parquet too many small files ?

Re: spark parquet too many small files ?

Re: spark parquet too many small files ?

spark parquet too many small files ?

6 matches

Site Navigation

Mail list logo

Footer information