Congling Xia created KYLIN-4320: ----------------------------------- Summary: number of replicas of Cuboid files cannot be configured for Spark engine Key: KYLIN-4320 URL: https://issues.apache.org/jira/browse/KYLIN-4320 Project: Kylin Issue Type: Bug Components: Job Engine Reporter: Congling Xia Attachments: cuboid_replications.png
Hi, team. I try to change `dfs.replication` to 3 by adding the following config override {code:java} kylin.engine.spark-conf.spark.hadoop.dfs.replication=3 {code} Then, I get a strange result - numbers of replicas of cuboid files varies even though they are in the same level. !cuboid_replications.png! I guess it is due to the conflicting settings is SparkUtil: {code:java} public static void modifySparkHadoopConfiguration(SparkContext sc) throws Exception { sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid intermediate files, replication=2 sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", "true"); sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK"); sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.DefaultCodec"); // or org.apache.hadoop.io.compress.SnappyCodec } {code} It may be a bug for Spark property precedence. After checking [Spark documents|[http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]], it seems that some programmatically set properties may not take effect and it is not a recommended way for Spark job configuration. Anyway, cuboid files may survive for weeks until expired or been merged, the configuration rewrite in `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes those files less reliable. Is there any way to force cuboid files to remain 3 replicas? or shall we remove the code in SparkUtil to make kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly? -- This message was sent by Atlassian Jira (v8.3.4#803005)