[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22847 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r230248773 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- @rednaxelafx the wide table benchmark I used has 400 columns, whole stage codegen is disabled by default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229943260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- Oh I see, you're using the column name...that's not the right place to put the "prefix". Column names are almost never carried over to the generated code in the current framework (the only exception is the lambda variable name). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229942325 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- The "freshNamePrefix" prefix is only applied in whole-stage codegen, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L87 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L169 It doesn't take any effect in non-whole-stage codegen. If you intend to stress test expression codegen but don't see the prefix being prepended, you're probably not adding it in the right place. Where did you add it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229919857 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- Seems like long alias names have no influence. ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 [info] Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz [info] projection on wide table:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative [info] [info] split threshold 106512 / 6736 0.26210.4 1.0X [info] split threshold 100 5730 / 6329 0.25464.9 1.1X [info] split threshold 1024 3119 / 3184 0.32974.6 2.1X [info] split threshold 2048 2981 / 3100 0.42842.9 2.2X [info] split threshold 4096 3289 / 3379 0.33136.6 2.0X [info] split threshold 8196 4307 / 4338 0.24108.0 1.5X [info] split threshold 65536 29147 / 30212 0.0 27797.0 0.2X ``` No `averylongprefixrepeatedmultipletimes` in the **expression code gen**: ``` /* 047 */ private void createExternalRow_0_8(InternalRow i, Object[] values_0) { /* 048 */ /* 049 */ // input[80, bigint, false] /* 050 */ long value_81 = i.getLong(80); /* 051 */ if (false) { /* 052 */ values_0[80] = null; /* 053 */ } else { /* 054 */ values_0[80] = value_81; /* 055 */ } /* 056 */ /* 057 */ // input[81, bigint, false] /* 058 */ long value_82 = i.getLong(81); /* 059 */ if (false) { /* 060 */ values_0[81] = null; /* 061 */ } else { /* 062 */ values_0[81] = value_82; /* 063 */ } /* 064 */ /* 065 */ // input[82, bigint, false] /* 066 */ long value_83 = i.getLong(82); /* 067 */ if (false) { /* 068 */ values_0[82] = null; /* 069 */ } else { /* 070 */ values_0[82] = value_83; /* 071 */ } /* 072 */ ... ``` My benchmark: ``` object WideTableBenchmark extends SqlBasedBenchmark { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { runBenchmark("projection on wide table") { val N = 1 << 20 val df = spark.range(N) val columns = (0 until 400).map{ i => s"id as averylongprefixrepeatedmultipletimes_id$i"} val benchmark = new Benchmark("projection on wide table", N, output = output) Seq("10", "100", "1024", "2048", "4096", "8196", "65536").foreach { n => benchmark.addCase(s"split threshold $n", numIters = 5) { iter => withSQLConf("spark.testing.codegen.splitThreshold" -> n) { df.selectExpr(columns: _*).foreach(identity(_)) } } } benchmark.run() } } } ``` Will keep benchmarking for the complex expression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229879855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- Could you try some very long alias names or complex expressions? You will get different number, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229638917 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- @kiszk agree, `1000` might be not the best, see my benchmark for the wide table, `2048` is better. ``` projection on wide table Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz projection on wide table:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative split threshold 108464 / 8737 0.1 8072.0 1.0X split threshold 100 5959 / 6251 0.2 5683.4 1.4X split threshold 1024 3202 / 3248 0.3 3053.2 2.6X split threshold 2048 3009 / 3097 0.3 2869.2 2.8X split threshold 4096 3414 / 3458 0.3 3256.1 2.5X split threshold 8196 4095 / 4112 0.3 3905.5 2.1X split threshold 65536 28800 / 29705 0.0 27465.8 0.3X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229577559 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,18 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source-code splitting in the codegen. When the number of characters " + + "in a single JAVA function (without comment) exceeds the threshold, the function will be " + --- End diff -- nit: `JAVA` -> `Java` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229577345 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- 1000 is conservative. But, there is no recommendation since the bytecode size depends on the content (e.g. `0`'s byte code length is 1. `9`'s byte code lengh is 2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229572426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,18 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We cannot know how many bytecode will " + --- End diff -- `The threshold of source-code splitting in the codegen. When the number of characters in a single JAVA function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229538148 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We cannot know how many bytecode will " + + "be generated, so use the code length as metric. When running on HotSpot, a function's " + + "bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not " + + "be too small, otherwise there will be many function calls.") +.intConf +.createWithDefault(1024) --- End diff -- let's add a check value to make sure the value is positive. We can figure out a lower and upper bound later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r229478558 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- Yep. That could be `max`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228789484 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + --- End diff -- Not a big deal but I would avoid abbreviation in documentation. `can't` -> `cannot` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228788123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- To be more accurately, I think I should add `When running on HotSpot, a function's bytecode should not go beyond 8KB...`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228755710 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + + "be generated, so use the code length as metric. A function's bytecode should not go " + + "beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise " + + "there will be many function calls.") +.intConf --- End diff -- According to the description, it seems that we had better have `checkValue` here. Could you recommend the reasonable min/max values, @kiszk ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228600757 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The maximum source code length of a single Java function by codegen. When the " + + "generated Java function source code exceeds this threshold, it will be split into " + + "multiple small functions, each function length is spark.sql.codegen.methodSplitThreshold." + + " A function's bytecode should not go beyond 8KB, otherwise it will not be JITted, should " + + "also not be too small, or we will have many function calls. We can't know how many " + --- End diff -- it will not be JITted; it also should not not be too small, otherwise there will be many function calls. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228598058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The maximum source code length of a single Java function by codegen. When the " + + "generated Java function source code exceeds this threshold, it will be split into " + + "multiple small functions, each function length is spark.sql.codegen.methodSplitThreshold." + --- End diff -- IMHO, `spark.sql.codegen.methodSplitThreshold` can be used, but the description should be changed. For example, `The threshold of source code length without comment of a single Java function by codegen to be split. When the generated Java function source code exceeds this threshold, it will be split into multiple small functions. ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228483780 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The maximum source code length of a single Java function by codegen. When the " + + "generated Java function source code exceeds this threshold, it will be split into " + + "multiple small functions, each function length is spark.sql.codegen.methodSplitThreshold." + --- End diff -- `each function length is spark.sql.codegen.methodSplitThreshold` this is not true, the method size is always larger than the threshold. cc @kiszk any idea about the naming and description of this config? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
GitHub user yucai opened a pull request: https://github.com/apache/spark/pull/22847 [SPARK-25850][SQL] Make the split threshold for the code generated method configurable ## What changes were proposed in this pull request? As per the [discussion](https://github.com/apache/spark/pull/22823/files#r228400706), add a new configuration to make the split threshold for the code generated method configurable. When the generated Java function source code exceeds the split threshold, it will be split into multiple small functions, each function length is spark.sql.codegen.methodSplitThreshold. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/yucai/spark splitThreshold Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22847.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22847 commit 188a9476e3504c151ebe27b362a080469c262674 Author: yucai Date: 2018-10-26T08:00:24Z [SPARK-25850][SQL] Make the split threshold for the code generated method configurable --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org