Congling Xia created KYLIN-4320:
-----------------------------------
Summary: number of replicas of Cuboid files cannot be configured
for Spark engine
Key: KYLIN-4320
URL: https://issues.apache.org/jira/browse/KYLIN-4320
Project: Kylin
Issue Type: Bug
Components: Job Engine
Reporter: Congling Xia
Attachments: cuboid_replications.png
Hi, team. I try to change `dfs.replication` to 3 by adding the following config
override
{code:java}
kylin.engine.spark-conf.spark.hadoop.dfs.replication=3
{code}
Then, I get a strange result - numbers of replicas of cuboid files varies even
though they are in the same level.
!cuboid_replications.png!
I guess it is due to the conflicting settings is SparkUtil:
{code:java}
public static void modifySparkHadoopConfiguration(SparkContext sc) throws
Exception {
sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid
intermediate files, replication=2
sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress",
"true");
sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type",
"BLOCK");
sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec",
"org.apache.hadoop.io.compress.DefaultCodec"); // or
org.apache.hadoop.io.compress.SnappyCodec
}
{code}
It may be a bug for Spark property precedence. After checking [Spark
documents|[http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]],
it seems that some programmatically set properties may not take effect and it
is not a recommended way for Spark job configuration.
Anyway, cuboid files may survive for weeks until expired or been merged, the
configuration rewrite in
`org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes
those files less reliable.
Is there any way to force cuboid files to remain 3 replicas? or shall we remove
the code in SparkUtil to make
kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)