subject:"\[jira\] \[Commented\] \(SPARK\-20107\) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md"

[jira] [Commented] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

2017-07-04 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073951#comment-16073951
 ] 

Steve Loughran commented on SPARK-20107:


If you are curious, I've just written out the v1 and v2 commit algorithms with 
cost/complexity estimates: 
https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer_architecture.md



> Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to 
> configuration.md
> ---
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can 
> speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files. 
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> We should add this option to 
> [configuration.md|http://spark.apache.org/docs/latest/configuration.html].
> All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 or higher versions support this improvement.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

2017-04-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981552#comment-15981552
 ] 

Steve Loughran commented on SPARK-20107:


This does not solve the problem you think it does, not with S3.

Both commit algorithms assumes the filesystem is consistent, that a listFiles 
will return the list of files it needs to rename. S3 isn't a consistent FS, so 
newly created files aren't going to get picked up. It may appear to work well 
in development, but in production there is a non-zero chance data will get 
missed.

You need the HADOOP-13786 committer underneath spark for high performance, 
fault tolerant renames. 

> Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to 
> configuration.md
> ---
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can 
> speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files. 
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> We should add this option to 
> [configuration.md|http://spark.apache.org/docs/latest/configuration.html].
> All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 or higher versions support this improvement.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

[jira] [Commented] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

2 matches

Site Navigation

Mail list logo

Footer information