[jira] [Commented] (KYLIN-4320) number of replicas of Cuboid files cannot be configured for Spark engine

ASF subversion and git services (Jira) Tue, 28 Apr 2020 00:12:54 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094228#comment-17094228
 ]


ASF subversion and git services commented on KYLIN-4320:
--------------------------------------------------------

Commit 78afb52b57736bf0bfd10a0299ac9b44f1119400 in kylin's branch 
refs/heads/master from Shao Feng Shi
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=78afb52 ]

Revert "KYLIN-4320 number of replicas of Cuboid files cannot be configured for 
Spark engine"

This reverts commit 926515bfc217167fe570c0cf21a39f54e5b5d1ff.


> number of replicas of Cuboid files cannot be configured for Spark engine
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-4320
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4320
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v3.0.1
>            Reporter: Congling Xia
>            Assignee: Yaqian Zhang
>            Priority: Major
>             Fix For: v3.1.0
>
>         Attachments: cuboid_replications.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Hi, team. I try to change `dfs.replication` to 3 by adding the following 
> config override
> {code:java}
> kylin.engine.spark-conf.spark.hadoop.dfs.replication=3
> {code}
> Then, I get a strange result - numbers of replicas of cuboid files varies 
> even though they are in the same level.
> !cuboid_replications.png!
> I guess it is due to the conflicting settings in SparkUtil:
> {code:java}
> public static void modifySparkHadoopConfiguration(SparkContext sc) throws 
> Exception {
>     sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid 
> intermediate files, replication=2
>     
> sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", 
> "true");
>     
> sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type",
>  "BLOCK");
>     
> sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec",
>  "org.apache.hadoop.io.compress.DefaultCodec"); // or 
> org.apache.hadoop.io.compress.SnappyCodec
> }
> {code}
> It may be a bug for Spark property precedence. After checking [Spark 
> documents|#dynamically-loading-spark-properties]], it seems that some 
> programmatically set properties may not take effect and it is not a 
> recommended way for Spark job configuration.
>  
> Anyway, cuboid files may survive for weeks until expired or been merged, the 
> configuration rewrite in 
> `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` 
> makes those files less reliable.
> Is there any way to force cuboid files to remain 3 replicas? or shall we 
> remove the code in SparkUtil to make 
> kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KYLIN-4320) number of replicas of Cuboid files cannot be configured for Spark engine

Reply via email to