Re: How to increase Spark partitions for the DataFrame?

Lan Jiang Thu, 08 Oct 2015 12:15:05 -0700

Hmm, that’s odd. 

You can always use repartition(n) to increase the partition number, but then 
there will be shuffle. How large is your ORC file? Have you used NameNode UI to 
check how many HDFS blocks each ORC file has?


Lan


> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
> 
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that it 
> has only 12 partitions because of 12 HDFS blocks and hive orc file strip size 
> is 33554432.
> 
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> The partition number should be the same as the HDFS block number instead of 
> file number. Did you confirmed from the spark UI that only 12 partitions were 
> created? What is your ORC orc.stripe.size?
> 
> Lan
> 
> 
> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com 
> > <mailto:umesh.ka...@gmail.com>> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it loads
> > directory which contains 12 ORC files. Now since HDFS directory contains 12
> > files it will create 12 partitions by default. These directory is huge and
> > when ORC files gets decompressed it becomes around 10 GB how do I increase
> > partitions for the below code so that my Spark job runs faster and does not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> >  
> > <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
>

Re: How to increase Spark partitions for the DataFrame?

Reply via email to