Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Chanh Le
Hi Gene, Thank for your support. I agree with you because of number executor but many parquet files influence to read performance so I need a way to improve that. So the way I work around is df.coalesce(1) .write.mode(SaveMode.Overwrite).partitionBy("network_id")

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Gene Pang
Hi Chanh, You should be able to set the Alluxio block size with: sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb") I think you have many parquet files because you have many Spark executors writing out their partition of the files. Hope that helps, Gene On Sun, Jul

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-03 Thread Chanh Le
Hi Gene, Could you give some suggestions on that? > On Jul 1, 2016, at 5:31 PM, Ted Yu wrote: > > The comment from zhangxiongfei was from a year ago. > > Maybe something changed since them ? > > On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Ted Yu
Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is in use. FYI On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma wrote: > Ok. > I came across this issue. > Not sure if you already assessed this: >

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
Ok. I came across this issue. Not sure if you already assessed this: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 The workaround mentioned may work for you . Thanks Deepak On 1 Jul 2016 9:34 am, "Chanh Le" wrote: > Hi Deepark, > Thank for

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Chanh Le
Hi Deepark, Thank for replying. The way to write into alluxio is df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLYā€¯) I partition by 2 columns and store. I just want when I write it automatic write a size properly for what I