Andras Istvan Nagy created KYLIN-4299:
-----------------------------------------
Summary: Issue with building real-time segment cache into HBase
when using S3 as working dir
Key: KYLIN-4299
URL: https://issues.apache.org/jira/browse/KYLIN-4299
Project: Kylin
Issue Type: Bug
Components: Real-time Streaming
Affects Versions: v3.0.0-alpha2
Reporter: Andras Istvan Nagy
We have an issue with using S3 as working dir for Kylin when using real-time
streaming. The reason why we would like to do this is to have no state in HDFS,
so the actual runtime environment running Kylin becomes stateless.
We already have HBase data on S3, but there is persistent data also in
{{kylin.env.hdfs-working-dir}} (cube dictionaries), so we need to have that in
S3 as well to have a setup where it's possible to fail over to a new cluster
without having to rebuild all cubes.
We are using the real-time streaming feature in Kylin, which persists segment
caches hourly and a MR job merges those hourly segments into HBase. In these MR
jobs, we get the following exception:
{code:java}
Error: java.lang.IllegalArgumentException: Wrong FS:
s3://kylin-XXXXX/kylin-dev/hdfs-rootdir/kylin_metadata/stream/tops_jaywalks/20191206010000_20191206020000/1/1,
expected: hdfs://ip-24-0-3-243.us-west-2.compute.internal:8020 at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669) at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:897)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:964)
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:961)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:971)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1551) at
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1577) at
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1625) at
org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:1808) at
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1807) at
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1785) at
org.apache.hadoop.fs.FileSystem$6.<init>(FileSystem.java:1887) at
org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1885) at
org.apache.kylin.engine.mr.streaming.ColumnarFilesReader.checkPath(ColumnarFilesReader.java:46)
at
org.apache.kylin.engine.mr.streaming.ColumnarFilesReader.<init>(ColumnarFilesReader.java:41)
at
org.apache.kylin.engine.mr.streaming.DictsReader.<init>(DictsReader.java:43) at
org.apache.kylin.engine.mr.streaming.ColumnarSplitDictReader.init(ColumnarSplitDictReader.java:65)
at
org.apache.kylin.engine.mr.streaming.ColumnarSplitDictReader.<init>(ColumnarSplitDictReader.java:52)
at
org.apache.kylin.engine.mr.streaming.ColumnarSplitDictInputFormat.createRecordReader(ColumnarSplitDictInputFormat.java:32)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) at
org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:173)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
at
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
at
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)