Hi Gene, Thank for your support. I agree with you because of number executor but many parquet files influence to read performance so I need a way to improve that. So the way I work around is df.coalesce(1) .write.mode(SaveMode.Overwrite).partitionBy("network_id") .parquet(s"$alluxioURL/$outFolderName/time=${dailyFormat.print(jobRunTime)}") I know this is not good because create a shuffle and cost time but the read improve a lot. Right now, I am using that method to partition my data.
Regards, Chanh > On Jul 8, 2016, at 8:33 PM, Gene Pang <gene.p...@gmail.com> wrote: > > Hi Chanh, > > You should be able to set the Alluxio block size with: > > sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb") > > I think you have many parquet files because you have many Spark executors > writing out their partition of the files. > > Hope that helps, > Gene > > On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le <giaosu...@gmail.com > <mailto:giaosu...@gmail.com>> wrote: > Hi Gene, > Could you give some suggestions on that? > > > >> On Jul 1, 2016, at 5:31 PM, Ted Yu <yuzhih...@gmail.com >> <mailto:yuzhih...@gmail.com>> wrote: >> >> The comment from zhangxiongfei was from a year ago. >> >> Maybe something changed since them ? >> >> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosu...@gmail.com >> <mailto:giaosu...@gmail.com>> wrote: >> Hi Ted, >> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true) >> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456) >> but It seems not working. >> >> <Screen_Shot_2016-07-01_at_2_06_27_PM.png> >> >> >>> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhih...@gmail.com >>> <mailto:yuzhih...@gmail.com>> wrote: >>> >>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is >>> in use. >>> >>> FYI >>> >>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmc...@gmail.com >>> <mailto:deepakmc...@gmail.com>> wrote: >>> Ok. >>> I came across this issue. >>> Not sure if you already assessed this: >>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 >>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921> >>> The workaround mentioned may work for you . >>> >>> Thanks >>> Deepak >>> >>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosu...@gmail.com >>> <mailto:giaosu...@gmail.com>> wrote: >>> Hi Deepark, >>> Thank for replying. The way to write into alluxio is >>> df.write.mode(SaveMode.Append).partitionBy("network_id", >>> "time").parquet("alluxio://master1:19999/FACT_ADMIN_HOURLY <>”) >>> >>> >>> I partition by 2 columns and store. I just want when I write it automatic >>> write a size properly for what I already set in Alluxio 512MB per block. >>> >>> >>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com >>>> <mailto:deepakmc...@gmail.com>> wrote: >>>> >>>> Before writing coalesing your rdd to 1 . >>>> It will create only 1 output file . >>>> Multiple part file happens as all your executors will be writing their >>>> partitions to separate part files. >>>> >>>> Thanks >>>> Deepak >>>> >>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosu...@gmail.com >>>> <mailto:giaosu...@gmail.com>> wrote: >>>> Hi everyone, >>>> I am using Alluxio for storage. But I am little bit confuse why I am do >>>> set block size of alluxio is 512MB and my file part only few KB and too >>>> many part. >>>> Is that normal? Because I want to read it fast? Is that many part effect >>>> the read operation? >>>> How to set the size of file part? >>>> >>>> Thanks. >>>> Chanh >>>> >>>> >>>> >>>> >>>> >>>> <Screen_Shot_2016-07-01_at_9_24_55_AM.png> >>> >>> >> >> > >