Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha
Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.

On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote:

 Is there a join involved in your sql?
 Have a look at spark.sql.shuffle.partitions?

 Srikanth

 On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi Srikant thanks for the response. I have the following code:

 hiveContext.sql(insert into... ).coalesce(6)

 Above code does not create 6 part files it creates around 200 small
 files.

 Please guide. Thanks.
 On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

 Did you do

 yourRdd.coalesce(6).saveAsTextFile()

 or

 yourRdd.coalesce(6)
 yourRdd.saveAsTextFile()
 ?

 Srikanth

 On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
 doesn't reduce part-x files. Even after calling above method I still
 see around 200 small part files of size 20 mb each which is again orc 
 files.


 On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of
 files every
 day. File size may very from MBs to GBs. After finishing job I
 usually save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
 file as
 of Spark 1.4

 Spark job creates plenty of small part files in final output
 directory. As
 far as I understand Spark creates part file for each partition/task
 please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these
 parquet/orc
 directory and I heard Hive is slow when we have large no of small
 files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: How do we control output part files created by Spark job?

2015-07-11 Thread Srikanth
Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.

On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
 I think reducing shuffle partitions will slower my group by query of
 hiveContext or it wont slow it down please guide.

 On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote:

 Is there a join involved in your sql?
 Have a look at spark.sql.shuffle.partitions?

 Srikanth

 On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi Srikant thanks for the response. I have the following code:

 hiveContext.sql(insert into... ).coalesce(6)

 Above code does not create 6 part files it creates around 200 small
 files.

 Please guide. Thanks.
 On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

 Did you do

 yourRdd.coalesce(6).saveAsTextFile()

 or

 yourRdd.coalesce(6)
 yourRdd.saveAsTextFile()
 ?

 Srikanth

 On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi I tried both approach using df. repartition(6) and df.coalesce(6)
 it doesn't reduce part-x files. Even after calling above method I 
 still
 see around 200 small part files of size 20 mb each which is again orc 
 files.


 On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of
 files every
 day. File size may very from MBs to GBs. After finishing job I
 usually save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
 file as
 of Spark 1.4

 Spark job creates plenty of small part files in final output
 directory. As
 far as I understand Spark creates part file for each partition/task
 please
 correct me if I am wrong. How do we control amount of part files
 Spark
 creates? Finally I would like to create Hive table using these
 parquet/orc
 directory and I heard Hive is slow when we have large no of small
 files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: How do we control output part files created by Spark job?

2015-07-10 Thread Srikanth
Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?

Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi Srikant thanks for the response. I have the following code:

 hiveContext.sql(insert into... ).coalesce(6)

 Above code does not create 6 part files it creates around 200 small files.

 Please guide. Thanks.
 On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

 Did you do

 yourRdd.coalesce(6).saveAsTextFile()

 or

 yourRdd.coalesce(6)
 yourRdd.saveAsTextFile()
 ?

 Srikanth

 On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
 doesn't reduce part-x files. Even after calling above method I still
 see around 200 small part files of size 20 mb each which is again orc files.


 On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually
 save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
 file as
 of Spark 1.4

 Spark job creates plenty of small part files in final output
 directory. As
 far as I understand Spark creates part file for each partition/task
 please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these
 parquet/orc
 directory and I heard Hive is slow when we have large no of small
 files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
doesn't reduce part-x files. Even after calling above method I still
see around 200 small part files of size 20 mb each which is again orc files.

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually
 save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
 as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory. As
 far as I understand Spark creates part file for each partition/task please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How do we control output part files created by Spark job?

2015-07-07 Thread Gylfi
Hi. 

I am just wondering if the rdd was actually modified. 
Did you test it by printing rdd.partitions.length before and after? 

Regards,
Gylfi. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do we control output part files created by Spark job?

2015-07-07 Thread ponkin
Hi,
Did you try to reduce number of executors and cores? usually num-executors *
executor-cores = number of parallel tasks, so you can reduce number of
parallel tasks in command line like
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
for more details see
https://spark.apache.org/docs/1.2.0/running-on-yarn.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do we control output part files created by Spark job?

2015-07-07 Thread Srikanth
Did you do

yourRdd.coalesce(6).saveAsTextFile()

or

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
 doesn't reduce part-x files. Even after calling above method I still
 see around 200 small part files of size 20 mb each which is again orc files.


 On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually
 save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
 file as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory.
 As
 far as I understand Spark creates part file for each partition/task
 please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these
 parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
Hi Srikant thanks for the response. I have the following code:

hiveContext.sql(insert into... ).coalesce(6)

Above code does not create 6 part files it creates around 200 small files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

 Did you do

 yourRdd.coalesce(6).saveAsTextFile()

 or

 yourRdd.coalesce(6)
 yourRdd.saveAsTextFile()
 ?

 Srikanth

 On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
 doesn't reduce part-x files. Even after calling above method I still
 see around 200 small part files of size 20 mb each which is again orc files.


 On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Try coalesce function to limit no of part files
 On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually
 save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
 file as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory.
 As
 far as I understand Spark creates part file for each partition/task
 please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these
 parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






How do we control output part files created by Spark job?

2015-07-06 Thread kachau
Hi I am having couple of Spark jobs which processes thousands of files every
day. File size may very from MBs to GBs. After finishing job I usually save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file as
of Spark 1.4

Spark job creates plenty of small part files in final output directory. As
far as I understand Spark creates part file for each partition/task please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu
Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
 as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory. As
 far as I understand Spark creates part file for each partition/task please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would 
impact the parallelism of the next jobs that reads these file from HDFS.

Mohammed


-Original Message-
From: kachau [mailto:umesh.ka...@gmail.com] 
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
Subject: How do we control output part files created by Spark job?

Hi I am having couple of Spark jobs which processes thousands of files every 
day. File size may very from MBs to GBs. After finishing job I usually save 
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file as of 
Spark 1.4

Spark job creates plenty of small part files in final output directory. As far 
as I understand Spark creates part file for each partition/task please correct 
me if I am wrong. How do we control amount of part files Spark creates? Finally 
I would like to create Hive table using these parquet/orc directory and I heard 
Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do we control output part files created by Spark job?

2015-07-06 Thread Gylfi
Hi. 

Have you tried to repartition the finalRDD before saving? 
This link might help. 
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html

Regards,
Gylfi.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org