Re: Why so many parquet file part when I store data in Alluxio or File?

Chanh Le Fri, 08 Jul 2016 07:33:05 -0700

Hi Gene,
Thank for your support. I agree with you because of number executor but many 
parquet files influence to read performance so I need a way to improve that. So 
the way I work around is 
  df.coalesce(1)
  .write.mode(SaveMode.Overwrite).partitionBy("network_id")
  .parquet(s"$alluxioURL/$outFolderName/time=${dailyFormat.print(jobRunTime)}")
I know this is not good because create a shuffle and cost time but the read 
improve a lot. Right now, I am using that method to partition my data.


Regards,
Chanh


> On Jul 8, 2016, at 8:33 PM, Gene Pang <gene.p...@gmail.com> wrote:
> 
> Hi Chanh,
> 
> You should be able to set the Alluxio block size with:
> 
> sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb")
> 
> I think you have many parquet files because you have many Spark executors 
> writing out their partition of the files.
> 
> Hope that helps,
> Gene
> 
> On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Gene,
> Could you give some suggestions on that?
> 
> 
> 
>> On Jul 1, 2016, at 5:31 PM, Ted Yu <yuzhih...@gmail.com 
>> <mailto:yuzhih...@gmail.com>> wrote:
>> 
>> The comment from zhangxiongfei was from a year ago.
>> 
>> Maybe something changed since them ?
>> 
>> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Ted,
>> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true)
>> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>> but It seems not working.
>> 
>> <Screen_Shot_2016-07-01_at_2_06_27_PM.png>
>> 
>> 
>>> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhih...@gmail.com 
>>> <mailto:yuzhih...@gmail.com>> wrote:
>>> 
>>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is 
>>> in use.
>>> 
>>> FYI
>>> 
>>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmc...@gmail.com 
>>> <mailto:deepakmc...@gmail.com>> wrote:
>>> Ok.
>>> I came across this issue.
>>> Not sure if you already assessed this:
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 
>>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921>
>>> The workaround mentioned may work for you .
>>> 
>>> Thanks
>>> Deepak
>>> 
>>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi Deepark,
>>> Thank for replying. The way to write into alluxio is 
>>> df.write.mode(SaveMode.Append).partitionBy("network_id", 
>>> "time").parquet("alluxio://master1:19999/FACT_ADMIN_HOURLY <>”)
>>> 
>>> 
>>> I partition by 2 columns and store. I just want when I write it automatic 
>>> write a size properly for what I already set in Alluxio 512MB per block.
>>> 
>>> 
>>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com 
>>>> <mailto:deepakmc...@gmail.com>> wrote:
>>>> 
>>>> Before writing coalesing your rdd to 1 .
>>>> It will create only 1 output file .
>>>> Multiple part file happens as all your executors will be writing their 
>>>> partitions to separate part files.
>>>> 
>>>> Thanks
>>>> Deepak
>>>> 
>>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosu...@gmail.com 
>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>> Hi everyone,
>>>> I am using Alluxio for storage. But I am little bit confuse why I am do 
>>>> set block size of alluxio is 512MB and my file part only few KB and too 
>>>> many part.
>>>> Is that normal? Because I want to read it fast? Is that many part effect 
>>>> the read operation?
>>>> How to set the size of file part?
>>>> 
>>>> Thanks.
>>>> Chanh
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> <Screen_Shot_2016-07-01_at_9_24_55_AM.png>
>>> 
>>> 
>> 
>> 
> 
>

Re: Why so many parquet file part when I store data in Alluxio or File?

Reply via email to