Re: Reading AVRO from S3 - No parallelism

prithish Thu, 27 Oct 2016 18:28:19 -0700

 
 
The Avro files were 500-600kb in size and that folder contained around 1200 
files. The total folder size was around 600mb. Will try repartition. Thank you.


 
 
 
 

 
 
>  
> On Oct 28, 2016 at 2:24 AM,  <Michael Armbrust 
> (mailto:mich...@databricks.com)>  wrote:
>  
>  
>  
> How big are your avro files?    We collapse many small files into a single 
> partition to eliminate scheduler overhead.    If you need explicit 
> parallelism you can also repartition.
>  
>
>  
> On Thu, Oct 27, 2016 at 5:19 AM, Prithish  <prith...@gmail.com 
> (mailto:prith...@gmail.com)>  wrote:
>  
> >  
> >  
> > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. 
> > No matter how many executors I use or what configuration changes I make, 
> > the cluster doesn't seem to use all the executors. I am using the 
> > com.databricks.spark.avro library from databricks to read the AVRO.  
> >  
> >
> >  
> > However, if I try the same on CSV files (same S3 folder, same configuration 
> > and cluster), it does use all executors.  
> >  
> >
> >  
> > Is there something that I need to do to enable parallelism when using the 
> > AVRO databricks library?
> >  
> >
> >  
> > Thanks for your help.  
> >  
> >
> >  
> >
> >        
>

Re: Reading AVRO from S3 - No parallelism

Reply via email to