subject:"How do we control output part files created by Spark job\?"

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha

Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.

On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote:

Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?

Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

Hi Srikant thanks for the response. I have the following code:

hiveContext.sql(insert into... ).coalesce(6)

Above code does not create 6 part files it creates around 200 small
files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

Did you do

yourRdd.coalesce(6).saveAsTextFile()

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
wrote:

Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
doesn't reduce part-x files. Even after calling above method I still
see around 200 small part files of size 20 mb each which is again orc
files.

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of
files every
day. File size may very from MBs to GBs. After finishing job I
usually save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
file as
of Spark 1.4

Spark job creates plenty of small part files in final output
directory. As
far as I understand Spark creates part file for each partition/task
please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these
parquet/orc
directory and I heard Hive is slow when we have large no of small
files.
Please guide I am new to Spark. Thanks in advance.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Srikanth

Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.

On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.

On Sat, Jul 11, 2015 at 7:41 AM, Srikanth srikanth...@gmail.com wrote:

Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?

Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com
wrote:

Hi Srikant thanks for the response. I have the following code:

hiveContext.sql(insert into... ).coalesce(6)

Above code does not create 6 part files it creates around 200 small
files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

Did you do

yourRdd.coalesce(6).saveAsTextFile()

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
wrote:

Hi I tried both approach using df. repartition(6) and df.coalesce(6)
it doesn't reduce part-x files. Even after calling above method I
still
see around 200 small part files of size 20 mb each which is again orc
files.

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of
files every
day. File size may very from MBs to GBs. After finishing job I
usually save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
file as
of Spark 1.4

Spark job creates plenty of small part files in final output
directory. As
far as I understand Spark creates part file for each partition/task
please
correct me if I am wrong. How do we control amount of part files
Spark
creates? Finally I would like to create Hive table using these
parquet/orc
directory and I heard Hive is slow when we have large no of small
files.
Please guide I am new to Spark. Thanks in advance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-10 Thread Srikanth

Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?

Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

Hi Srikant thanks for the response. I have the following code:

hiveContext.sql(insert into... ).coalesce(6)

Above code does not create 6 part files it creates around 200 small files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

Did you do

yourRdd.coalesce(6).saveAsTextFile()

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
wrote:

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of files
every
day. File size may very from MBs to GBs. After finishing job I usually
save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
file as
of Spark 1.4

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of files
every
day. File size may very from MBs to GBs. After finishing job I usually
save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
as
of Spark 1.4

Spark job creates plenty of small part files in final output directory. As
far as I understand Spark creates part file for each partition/task please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Gylfi

Hi. 

I am just wondering if the rdd was actually modified. 
Did you test it by printing rdd.partitions.length before and after? 

Regards,
Gylfi. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-07 Thread ponkin

Hi,
Did you try to reduce number of executors and cores? usually num-executors *
executor-cores = number of parallel tasks, so you can reduce number of
parallel tasks in command line like
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
for more details see
https://spark.apache.org/docs/1.2.0/running-on-yarn.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Srikanth

Did you do

yourRdd.coalesce(6).saveAsTextFile()

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote:

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of files
every
day. File size may very from MBs to GBs. After finishing job I usually
save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
file as
of Spark 1.4

Spark job creates plenty of small part files in final output directory.
As
far as I understand Spark creates part file for each partition/task
please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these
parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha

Hi Srikant thanks for the response. I have the following code:

hiveContext.sql(insert into... ).coalesce(6)

Above code does not create 6 part files it creates around 200 small files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote:

Did you do

yourRdd.coalesce(6).saveAsTextFile()

yourRdd.coalesce(6)
yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com
wrote:

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

Hi I am having couple of Spark jobs which processes thousands of files
every
day. File size may very from MBs to GBs. After finishing job I usually
save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC
file as
of Spark 1.4

Spark job creates plenty of small part files in final output directory.
As
far as I understand Spark creates part file for each partition/task
please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these
parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How do we control output part files created by Spark job?

2015-07-06 Thread kachau

Hi I am having couple of Spark jobs which processes thousands of files every
day. File size may very from MBs to GBs. After finishing job I usually save
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file as
of Spark 1.4

Spark job creates plenty of small part files in final output directory. As
far as I understand Spark creates part file for each partition/task please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:

 Hi I am having couple of Spark jobs which processes thousands of files
 every
 day. File size may very from MBs to GBs. After finishing job I usually save
 using the following code

 finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
 dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
 as
 of Spark 1.4

 Spark job creates plenty of small part files in final output directory. As
 far as I understand Spark creates part file for each partition/task please
 correct me if I am wrong. How do we control amount of part files Spark
 creates? Finally I would like to create Hive table using these parquet/orc
 directory and I heard Hive is slow when we have large no of small files.
 Please guide I am new to Spark. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller

You could repartition the dataframe before saving it. However, that would 
impact the parallelism of the next jobs that reads these file from HDFS.

Mohammed

-Original Message-
From: kachau [mailto:umesh.ka...@gmail.com] 
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
Subject: How do we control output part files created by Spark job?

Hi I am having couple of Spark jobs which processes thousands of files every 
day. File size may very from MBs to GBs. After finishing job I usually save 
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file as of 
Spark 1.4

Spark job creates plenty of small part files in final output directory. As far 
as I understand Spark creates part file for each partition/task please correct 
me if I am wrong. How do we control amount of part files Spark creates? Finally 
I would like to create Hive table using these parquet/orc directory and I heard 
Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Gylfi

Hi. 

Have you tried to repartition the finalRDD before saving? 
This link might help. 
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html

Regards,
Gylfi.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

RE: How do we control output part files created by Spark job?

Re: How do we control output part files created by Spark job?

12 matches

Site Navigation

Mail list logo

Footer information