[jira] [Commented] (PARQUET-460) Parquet files concat tool

2016-02-14 Thread flykobe cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147015#comment-15147015
 ] 

flykobe cheng commented on PARQUET-460:
---

I have submit a pull request for the merge tool, could you help me to review? 
Thank you very much.

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-460) Parquet files concat tool

2016-02-03 Thread flykobe cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130055#comment-15130055
 ] 

flykobe cheng commented on PARQUET-460:
---

Thank you for your reply, Ryan Blue! It’s what we want! 
And do you mind I wrapper a MergeCommand in parquet-tool, or spark? 

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-460) Parquet files concat tool

2016-02-03 Thread flykobe cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130055#comment-15130055
 ] 

flykobe cheng edited comment on PARQUET-460 at 2/3/16 8:56 AM:
---

Thank you for your reply, @Ryan Blue! It’s what we want! 
And do you mind I wrapper a MergeCommand in parquet-tool, or spark? 


was (Author: flykobe):
Thank you for your reply, Ryan Blue! It’s what we want! 
And do you mind I wrapper a MergeCommand in parquet-tool, or spark? 

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-460) Parquet files concat tool

2016-02-03 Thread flykobe cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130055#comment-15130055
 ] 

flykobe cheng edited comment on PARQUET-460 at 2/3/16 8:58 AM:
---

Thank you for your reply, [~rdblue]! It’s what we want! 
And do you mind I wrapper a MergeCommand in parquet-tool, or spark? 


was (Author: flykobe):
Thank you for your reply, @Ryan Blue! It’s what we want! 
And do you mind I wrapper a MergeCommand in parquet-tool, or spark? 

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-460) Parquet files concat tool

2016-02-01 Thread flykobe cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

flykobe cheng reassigned PARQUET-460:
-

Assignee: flykobe cheng

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-460) Parquet files concat tool

2016-01-23 Thread flykobe cheng (JIRA)
flykobe cheng created PARQUET-460:
-

 Summary: Parquet files concat tool
 Key: PARQUET-460
 URL: https://issues.apache.org/jira/browse/PARQUET-460
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.8.0, 1.7.0
Reporter: flykobe cheng


Currently the parquet file generation is time consuming, most of time used for 
serialize and compress. It cost about 10mins to generate a 100MB~ parquet file 
in our scenario. We want to improve write performance without generate too many 
small files, which will impact read performance.

We propose to:
1. generate several small parquet files concurrently
2. merge small files to one file: concat the parquet blocks in binary (without 
SerDe), merge footers and modify the path and offset metadata.
We create ParquetFilesConcat class to finish step 2. It can be invoked by 
parquet.tools.command.ConcatCommand. If this function approved by parquet 
community, we will integrate it in spark.

It will impact compression and introduced more dictionary pages, but it can be 
improved by adjusting the concurrency of step 1.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)