[GitHub] [druid] clintropolis merged pull request #9203: [Backport] Web console: fix refresh button in segments view
clintropolis merged pull request #9203: [Backport] Web console: fix refresh button in segments view URL: https://github.com/apache/druid/pull/9203 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch 0.17.0 updated (7c7fffc -> 6874194)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a change to branch 0.17.0 in repository https://gitbox.apache.org/repos/asf/druid.git. from 7c7fffc Update Kinesis resharding information about task failures (#9104) (#9201) add 6874194 fix refresh button (#9195) (#9203) No new revisions were added by this update. Summary of changes: .../src/views/segments-view/segments-view.tsx | 29 +++--- 1 file changed, 14 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9201: [Backport] Update Kinesis resharding information about task failures (#9104)
clintropolis merged pull request #9201: [Backport] Update Kinesis resharding information about task failures (#9104) URL: https://github.com/apache/druid/pull/9201 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch 0.17.0 updated (e6246c9 -> 7c7fffc)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a change to branch 0.17.0 in repository https://gitbox.apache.org/repos/asf/druid.git. from e6246c9 Fix deserialization of maxBytesInMemory (#9092) (#9170) add 7c7fffc Update Kinesis resharding information about task failures (#9104) (#9201) No new revisions were added by this update. Summary of changes: docs/development/extensions-core/kinesis-ingestion.md | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367775037 ## File path: docs/development/extensions-core/hdfs.md ## @@ -94,7 +94,7 @@ For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/d Configuration for Google Cloud Storage -To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. +To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. Review comment: Thanks, I made changed based on the suggestions. But I would still want to keep the example properties for GCS, since they are pretty mandatory. The similar pattern is applied to [S3 configuration](https://github.com/apache/druid/pull/9171/files#diff-51abd0f049462a98772db4c6ea063be3R66-R93). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367775037 ## File path: docs/development/extensions-core/hdfs.md ## @@ -94,7 +94,7 @@ For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/d Configuration for Google Cloud Storage -To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. +To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. Review comment: Thanks, I added suggestions. But I would still want to keep the example properties for GCS, since they are pretty mandatory. The similar pattern is applied to [S3 configuration](https://github.com/apache/druid/pull/9171/files#diff-51abd0f049462a98772db4c6ea063be3R66-R93). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9198: Web console: fix bug where arrays can not be emptied out in the coordinator dialog
clintropolis merged pull request #9198: Web console: fix bug where arrays can not be emptied out in the coordinator dialog URL: https://github.com/apache/druid/pull/9198 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis opened a new pull request #9206: [Backport] Web console: fix bug where arrays can not be emptied out in the coordinator dialog
clintropolis opened a new pull request #9206: [Backport] Web console: fix bug where arrays can not be emptied out in the coordinator dialog URL: https://github.com/apache/druid/pull/9206 Backport of #9198 to 0.17.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated: allow empty values to be set in the auto form (#9198)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/druid.git The following commit(s) were added to refs/heads/master by this push: new ab26725 allow empty values to be set in the auto form (#9198) ab26725 is described below commit ab2672514b306243b8b72d64e7419fd8e8a18fe4 Author: Vadim Ogievetsky AuthorDate: Thu Jan 16 21:06:51 2020 -0800 allow empty values to be set in the auto form (#9198) --- web-console/src/components/auto-form/auto-form.tsx| 15 +++ .../coordinator-dynamic-config-dialog.tsx | 3 +++ 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/web-console/src/components/auto-form/auto-form.tsx b/web-console/src/components/auto-form/auto-form.tsx index 110bf49..66dffde 100644 --- a/web-console/src/components/auto-form/auto-form.tsx +++ b/web-console/src/components/auto-form/auto-form.tsx @@ -45,6 +45,7 @@ export interface Field { | 'json' | 'interval'; defaultValue?: any; + emptyValue?: any; suggestions?: Functor; placeholder?: string; min?: number; @@ -99,10 +100,16 @@ export class AutoForm> extends React.PureComponent const { model } = this.props; if (!model) return; -const newModel = - typeof newValue === 'undefined' -? deepDelete(model, field.name) -: deepSet(model, field.name, newValue); +let newModel: T; +if (typeof newValue === 'undefined') { + if (typeof field.emptyValue === 'undefined') { +newModel = deepDelete(model, field.name); + } else { +newModel = deepSet(model, field.name, field.emptyValue); + } +} else { + newModel = deepSet(model, field.name, newValue); +} this.modelChange(newModel); }; diff --git a/web-console/src/dialogs/coordinator-dynamic-config-dialog/coordinator-dynamic-config-dialog.tsx b/web-console/src/dialogs/coordinator-dynamic-config-dialog/coordinator-dynamic-config-dialog.tsx index 8d82c0c..044e7ea 100644 --- a/web-console/src/dialogs/coordinator-dynamic-config-dialog/coordinator-dynamic-config-dialog.tsx +++ b/web-console/src/dialogs/coordinator-dynamic-config-dialog/coordinator-dynamic-config-dialog.tsx @@ -180,6 +180,7 @@ export class CoordinatorDynamicConfigDialog extends React.PureComponent< { name: 'killDataSourceWhitelist', type: 'string-array', + emptyValue: [], info: ( <> List of dataSources for which kill tasks are sent if property{' '} @@ -191,6 +192,7 @@ export class CoordinatorDynamicConfigDialog extends React.PureComponent< { name: 'killPendingSegmentsSkipList', type: 'string-array', + emptyValue: [], info: ( <> List of dataSources for which pendingSegments are NOT cleaned up if property{' '} @@ -259,6 +261,7 @@ export class CoordinatorDynamicConfigDialog extends React.PureComponent< { name: 'decommissioningNodes', type: 'string-array', + emptyValue: [], info: ( <> List of historical services to 'decommission'. Coordinator will not assign new - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (448da78 -> 68ed2a2)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from 448da78 Speed up String first/last aggregators when folding isn't needed. (#9181) add 68ed2a2 Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (#9197) No new revisions were added by this update. Summary of changes: .../aggregation/first/StringFirstLastUtils.java| 2 +- .../first/StringFirstLastUtilsTest.java| 59 + .../apache/druid/sql/calcite/CalciteQueryTest.java | 147 - 3 files changed, 202 insertions(+), 6 deletions(-) create mode 100644 processing/src/test/java/org/apache/druid/query/aggregation/first/StringFirstLastUtilsTest.java - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9197: Fix LATEST / EARLIEST Buffer Aggregator does not work on String column
clintropolis merged pull request #9197: Fix LATEST / EARLIEST Buffer Aggregator does not work on String column URL: https://github.com/apache/druid/pull/9197 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (486c0fd -> 448da78)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from 486c0fd Bump Apache Parquet to 1.11.0 (#9129) add 448da78 Speed up String first/last aggregators when folding isn't needed. (#9181) No new revisions were added by this update. Summary of changes: .../apache/druid/java/util/common/StringUtils.java | 17 ++- .../druid/java/util/common/StringUtilsTest.java| 28 +++ .../aggregation/first/StringFirstAggregator.java | 44 +++--- .../first/StringFirstAggregatorFactory.java| 13 -- .../first/StringFirstBufferAggregator.java | 54 -- .../aggregation/first/StringFirstLastUtils.java| 29 +++- .../aggregation/last/StringLastAggregator.java | 44 +++--- .../last/StringLastAggregatorFactory.java | 14 -- .../last/StringLastBufferAggregator.java | 54 -- .../first/StringFirstAggregationTest.java | 8 +++- .../first/StringFirstBufferAggregatorTest.java | 46 -- .../last/StringLastAggregationTest.java| 5 ++ .../last/StringLastBufferAggregatorTest.java | 50 ++-- 13 files changed, 321 insertions(+), 85 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9181: Speed up String first/last aggregators when folding isn't needed.
clintropolis merged pull request #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on issue #9199: Fix TSV bugs
jon-wei commented on issue #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#issuecomment-575449429 @jihoonson thanks, latest update lgtm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9183: fix topn aggregation on numeric columns with null values
jon-wei commented on a change in pull request #9183: fix topn aggregation on numeric columns with null values URL: https://github.com/apache/druid/pull/9183#discussion_r367748951 ## File path: processing/src/main/java/org/apache/druid/query/topn/types/NullableNumericTopNColumnAggregatesProcessor.java ## @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.query.topn.types; + +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.query.aggregation.Aggregator; +import org.apache.druid.query.topn.BaseTopNAlgorithm; +import org.apache.druid.query.topn.TopNParams; +import org.apache.druid.query.topn.TopNQuery; +import org.apache.druid.query.topn.TopNResultBuilder; +import org.apache.druid.segment.BaseNullableColumnValueSelector; +import org.apache.druid.segment.Cursor; +import org.apache.druid.segment.StorageAdapter; + +import java.util.Map; +import java.util.function.Function; + +public abstract class NullableNumericTopNColumnAggregatesProcessor +implements TopNColumnAggregatesProcessor +{ + private final boolean hasNulls = !NullHandling.replaceWithDefault(); + final Function> converter; + Aggregator[] nullValueAggregates; + + protected NullableNumericTopNColumnAggregatesProcessor(Function> converter) + { +this.converter = converter; + } + + abstract Aggregator[] getValueAggregators(TopNQuery query, Selector selector, Cursor cursor); Review comment: Can you add javadocs for the abstract methods? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9183: fix topn aggregation on numeric columns with null values
jon-wei commented on a change in pull request #9183: fix topn aggregation on numeric columns with null values URL: https://github.com/apache/druid/pull/9183#discussion_r367748677 ## File path: processing/src/main/java/org/apache/druid/query/topn/types/NullableNumericTopNColumnAggregatesProcessor.java ## @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.query.topn.types; + +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.query.aggregation.Aggregator; +import org.apache.druid.query.topn.BaseTopNAlgorithm; +import org.apache.druid.query.topn.TopNParams; +import org.apache.druid.query.topn.TopNQuery; +import org.apache.druid.query.topn.TopNResultBuilder; +import org.apache.druid.segment.BaseNullableColumnValueSelector; +import org.apache.druid.segment.Cursor; +import org.apache.druid.segment.StorageAdapter; + +import java.util.Map; +import java.util.function.Function; + +public abstract class NullableNumericTopNColumnAggregatesProcessor +implements TopNColumnAggregatesProcessor +{ + private final boolean hasNulls = !NullHandling.replaceWithDefault(); + final Function> converter; + Aggregator[] nullValueAggregates; + + protected NullableNumericTopNColumnAggregatesProcessor(Function> converter) + { +this.converter = converter; + } + + abstract Aggregator[] getValueAggregators(TopNQuery query, Selector selector, Cursor cursor); + + abstract Map getAggregatesStore(); + + abstract Comparable convertAggregatorStoreKeyToColumnValue(Object aggregatorStoreKey); + + @Override + public int getCardinality(Selector selector) + { +return TopNParams.CARDINALITY_UNKNOWN; + } + + @Override + public Aggregator[][] getRowSelector(TopNQuery query, TopNParams params, StorageAdapter storageAdapter) + { +return null; + } + + @Override + public long scanAndAggregate( + TopNQuery query, + Selector selector, + Cursor cursor, + Aggregator[][] rowSelector + ) + { +initAggregateStore(); Review comment: I think the `initAggregateStore` call could be moved into `HeapBasedTopNAlgorithm.scanAndAggregate` since both impls call it as the first step This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575439630 This pull request **fixes 1 alert** when merging de0697cb1834f77a2fafc57e5d56673a558c5e83 into 486c0fd149d9837a64550ecb9e85d9b6cd4beb24 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-cbc76f7454ddad92381f9db32c521dcbd504afb8) **fixed alerts:** * 1 for Useless null check This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on issue #9199: Fix TSV bugs
jihoonson commented on issue #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#issuecomment-575437622 @jon-wei @clintropolis thanks for the review. I needed to delete one test and modify another which was added in https://github.com/apache/druid/pull/8915 because the delimited input format doesn't support those functionalities (recognizing quotes). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] suneet-s opened a new pull request #9205: [0.17.0] Tutorials use new ingestion spec where possible (#9155)
suneet-s opened a new pull request #9205: [0.17.0] Tutorials use new ingestion spec where possible (#9155) URL: https://github.com/apache/druid/pull/9205 Backports the following commits to 0.17.0: - Tutorials use new ingestion spec where possible (#9155) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] suneet-s opened a new pull request #9204: [0.17.0] Link javaOpts to middlemanager runtime.properties docs (#9101)
suneet-s opened a new pull request #9204: [0.17.0] Link javaOpts to middlemanager runtime.properties docs (#9101) URL: https://github.com/apache/druid/pull/9204 Backports the following commits to 0.17.0: - Link javaOpts to middlemanager runtime.properties docs (#9101) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed.
clintropolis commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#discussion_r367736165 ## File path: core/src/test/java/org/apache/druid/java/util/common/StringUtilsTest.java ## @@ -246,4 +246,32 @@ public void testRpad() Assert.assertEquals(s5, null); } + @Test + public void testChop() + { +Assert.assertEquals("foo", StringUtils.chop("foo", 5)); +Assert.assertEquals("fo", StringUtils.chop("foo", 2)); +Assert.assertEquals("", StringUtils.chop("foo", 0)); +Assert.assertEquals("smile for", StringUtils.chop("smile for the camera", 14)); Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis opened a new pull request #9203: [Backport] Web console: fix refresh button in segments view
clintropolis opened a new pull request #9203: [Backport] Web console: fix refresh button in segments view URL: https://github.com/apache/druid/pull/9203 Backport of #9195 to 0.17.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis opened a new pull request #9202: [Backport] fix null handling for arithmetic post aggregator comparator
clintropolis opened a new pull request #9202: [Backport] fix null handling for arithmetic post aggregator comparator URL: https://github.com/apache/druid/pull/9202 Backport of #9159 to 0.17.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
gianm commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575427555 > There's a TC error about an unresolved reference to the chop method That was from a javadoc for `fastLooseChop`. It looks like `chop` was moved to StringUtils, so I moved `fastLooseChop` to the same place. And added unit tests for good measure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei opened a new pull request #9201: [Backport] Update Kinesis resharding information about task failures (#9104)
jon-wei opened a new pull request #9201: [Backport] Update Kinesis resharding information about task failures (#9104) URL: https://github.com/apache/druid/pull/9201 Backport #9104 to 0.17.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] vogievetsky commented on issue #9190: Docs: move search to the left
vogievetsky commented on issue #9190: Docs: move search to the left URL: https://github.com/apache/druid/pull/9190#issuecomment-575421536 @fjy the [docusaurus](https://docusaurus.io/docs/en/search) template forces you to have a search in the header. Putting it in the ToC would be a lot more work. Do you think this position is better than before? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
jon-wei commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575418815 There's a TC error about an resolved reference to the chop method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei edited a comment on issue #9181: Speed up String first/last aggregators when folding isn't needed.
jon-wei edited a comment on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575418815 There's a TC error about an unresolved reference to the chop method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367716973 ## File path: docs/ingestion/hadoop.md ## @@ -149,11 +149,12 @@ For example, using the static input paths: ``` You can also read from cloud storage such as AWS S3 or Google Cloud Storage. -To do so, you need to install the necessary library under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_. +To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_. For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). ```bash java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ Review comment: This should go before the java command This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367716973 ## File path: docs/ingestion/hadoop.md ## @@ -149,11 +149,12 @@ For example, using the static input paths: ``` You can also read from cloud storage such as AWS S3 or Google Cloud Storage. -To do so, you need to install the necessary library under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_. +To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_. For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). ```bash java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ Review comment: This should go before the java command This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367720309 ## File path: docs/development/extensions-core/hdfs.md ## @@ -94,7 +94,7 @@ For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/d Configuration for Google Cloud Storage -To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. +To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. Review comment: For the installation section below, I think we could point to https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md and say the following, and remove the parts where we duplicate their setup instructions: > Please follow the instructions at https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md for configuring your `core-site.xml` with the filesystem and authentication properties needed for GCS." We can also add the following (it took me a while to find a download link for the connector): > The GCS connector library is available at https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters The line below: "Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2." can be updated to "Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9200: Optimize JoinCondition matching
gianm commented on a change in pull request #9200: Optimize JoinCondition matching URL: https://github.com/apache/druid/pull/9200#discussion_r367719384 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java ## @@ -133,26 +142,23 @@ public String getOriginalExpression() */ public boolean isAlwaysFalse() { -return nonEquiConditions.stream() -.anyMatch(expr -> expr.isLiteral() && !expr.eval(ExprUtils.nilBindings()).asBoolean()); +return anyFalseLiteralNonEquiConditions; } /** * Return whether this condition is a constant that is always true. */ public boolean isAlwaysTrue() { -return equiConditions.isEmpty() && - nonEquiConditions.stream() -.allMatch(expr -> expr.isLiteral() && expr.eval(ExprUtils.nilBindings()).asBoolean()); +return equiConditions.isEmpty() && allTrueLiteralNonEquiConditions; Review comment: It seems like `allTrueLiteralNonEquiConditions` is only used here; how about caching `isAlwaysTrue` directly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9200: Optimize JoinCondition matching
gianm commented on a change in pull request #9200: Optimize JoinCondition matching URL: https://github.com/apache/druid/pull/9200#discussion_r367719499 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java ## @@ -133,26 +142,23 @@ public String getOriginalExpression() */ public boolean isAlwaysFalse() { -return nonEquiConditions.stream() -.anyMatch(expr -> expr.isLiteral() && !expr.eval(ExprUtils.nilBindings()).asBoolean()); +return anyFalseLiteralNonEquiConditions; Review comment: Why not call this `isAlwaysFalse`? (It looks like it isn't used anywhere else, and it seems to me to be easier to understand the meaning of the field if it's named after what we want it to mean.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9129: Bump Apache Parquet to 1.11.0
clintropolis merged pull request #9129: Bump Apache Parquet to 1.11.0 URL: https://github.com/apache/druid/pull/9129 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (bd49ec0 -> 486c0fd)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from bd49ec0 Move result-to-array logic from SQL layer into QueryToolChests. (#9130) add 486c0fd Bump Apache Parquet to 1.11.0 (#9129) No new revisions were added by this update. Summary of changes: extensions-core/parquet-extensions/pom.xml | 2 +- licenses.yaml | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9199: Fix TSV bugs
jihoonson commented on a change in pull request #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#discussion_r367709959 ## File path: core/src/main/java/org/apache/druid/data/input/impl/CSVParser.java ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.data.input.impl; + +import com.opencsv.RFC4180Parser; +import com.opencsv.RFC4180ParserBuilder; +import com.opencsv.enums.CSVReaderNullFieldIndicator; +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.data.input.impl.DelimitedValueReader.DelimitedValueParser; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; + +public class CSVParser implements DelimitedValueParser +{ + private static final char SEPERATOR = ','; Review comment: Thanks, fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9199: Fix TSV bugs
jihoonson commented on a change in pull request #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#discussion_r367710195 ## File path: core/src/main/java/org/apache/druid/data/input/impl/FlatTextInputFormat.java ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.data.input.impl; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; +import org.apache.druid.data.input.InputFormat; +import org.apache.druid.indexer.Checks; +import org.apache.druid.indexer.Property; + +import javax.annotation.Nullable; +import java.util.Collections; +import java.util.List; +import java.util.Objects; + +public abstract class FlatTextInputFormat implements InputFormat +{ + private final List columns; + private final String listDelimiter; + private final String delimiter; + private final boolean findColumnsFromHeader; + private final int skipHeaderRows; + + FlatTextInputFormat( + @Nullable List columns, + @Nullable String listDelimiter, + String delimiter, + @Nullable Boolean hasHeaderRow, + @Nullable Boolean findColumnsFromHeader, + int skipHeaderRows + ) + { +this.columns = columns == null ? Collections.emptyList() : columns; +this.listDelimiter = listDelimiter; +this.delimiter = Preconditions.checkNotNull(delimiter, "delimiter"); +//noinspection ConstantConditions +if (columns == null || columns.isEmpty()) { + this.findColumnsFromHeader = Checks.checkOneNotNullOrEmpty( + ImmutableList.of( + new Property<>("hasHeaderRow", hasHeaderRow), + new Property<>("findColumnsFromHeader", findColumnsFromHeader) + ) + ).getValue(); +} else { + this.findColumnsFromHeader = false; +} +this.skipHeaderRows = skipHeaderRows; +Preconditions.checkArgument( +!delimiter.equals(listDelimiter), +"Cannot have same delimiter and list delimiter of [%s]", +delimiter +); +if (!this.columns.isEmpty()) { + for (String column : this.columns) { +Preconditions.checkArgument( Review comment: Hmm, I'm not sure why we do this check.. I guess it wouldn't harm anything if the column name contains the delimiter. Maybe we can remove this check later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9199: Fix TSV bugs
jon-wei commented on a change in pull request #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#discussion_r367707521 ## File path: core/src/main/java/org/apache/druid/data/input/impl/FlatTextInputFormat.java ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.data.input.impl; + +import com.fasterxml.jackson.annotation.JsonProperty; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; +import org.apache.druid.data.input.InputFormat; +import org.apache.druid.indexer.Checks; +import org.apache.druid.indexer.Property; + +import javax.annotation.Nullable; +import java.util.Collections; +import java.util.List; +import java.util.Objects; + +public abstract class FlatTextInputFormat implements InputFormat +{ + private final List columns; + private final String listDelimiter; + private final String delimiter; + private final boolean findColumnsFromHeader; + private final int skipHeaderRows; + + FlatTextInputFormat( + @Nullable List columns, + @Nullable String listDelimiter, + String delimiter, + @Nullable Boolean hasHeaderRow, + @Nullable Boolean findColumnsFromHeader, + int skipHeaderRows + ) + { +this.columns = columns == null ? Collections.emptyList() : columns; +this.listDelimiter = listDelimiter; +this.delimiter = Preconditions.checkNotNull(delimiter, "delimiter"); +//noinspection ConstantConditions +if (columns == null || columns.isEmpty()) { + this.findColumnsFromHeader = Checks.checkOneNotNullOrEmpty( + ImmutableList.of( + new Property<>("hasHeaderRow", hasHeaderRow), + new Property<>("findColumnsFromHeader", findColumnsFromHeader) + ) + ).getValue(); +} else { + this.findColumnsFromHeader = false; +} +this.skipHeaderRows = skipHeaderRows; +Preconditions.checkArgument( +!delimiter.equals(listDelimiter), +"Cannot have same delimiter and list delimiter of [%s]", +delimiter +); +if (!this.columns.isEmpty()) { + for (String column : this.columns) { +Preconditions.checkArgument( Review comment: Does this need to check for `listDelimiter` in the column names as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9199: Fix TSV bugs
jon-wei commented on a change in pull request #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199#discussion_r367704911 ## File path: core/src/main/java/org/apache/druid/data/input/impl/CSVParser.java ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.data.input.impl; + +import com.opencsv.RFC4180Parser; +import com.opencsv.RFC4180ParserBuilder; +import com.opencsv.enums.CSVReaderNullFieldIndicator; +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.data.input.impl.DelimitedValueReader.DelimitedValueParser; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; + +public class CSVParser implements DelimitedValueParser +{ + private static final char SEPERATOR = ','; Review comment: SEPERATOR -> SEPARATOR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367707446 ## File path: docs/development/extensions-core/hdfs.md ## @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + + Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| ||---|---|---| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - - +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml + + fs.s3a.impl + org.apache.hadoop.fs.s3a.S3AFileSystem + The implementation class of the S3A Filesystem + + + + fs.AbstractFileSystem.s3a.impl + org.apache.hadoop.fs.s3a.S3A + The implementation class of the S3A AbstractFileSystem. + + + + fs.s3a.access.key + AWS access key ID. Omit for IAM role-based or provider-based authentication. + your access key + + + + fs.s3a.secret.key + AWS secret key. Omit for IAM role-based or provider-based authentication. + your secret key + +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. + Configuration for Google Cloud Storage -Sample spec: +To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. Review comment: Thanks, fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367707404 ## File path: docs/development/modules.md ## @@ -148,29 +150,43 @@ To start a segment killing task, you need to access the old Coordinator console After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage. -### Adding a new Firehose +### Adding support for a new input source -There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory. +Adding support for a new input source requires to implement three interfaces, i.e., `InputSource`, `InputEntity`, and `InputSourceReader`. +`InputSource` is to define where the input data is stored. `InputEntity` is to define how data can be read in parallel +in [native parallel indexing](../ingestion/native-batch.md). +`InputSourceReader` defines how to read your new input source and you can simply use the provided `InputEntityIteratingReader` in most cases. -Adding a Firehose is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation +There is an example of this in the `druid-s3-extensions` module with the `S3InputSource` and `S3Entity`. + +Adding an InputSource is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation ``` java @Override public List getJacksonModules() { return ImmutableList.of( - new SimpleModule().registerSubtypes(new NamedType(StaticS3FirehoseFactory.class, "static-s3")) + new SimpleModule().registerSubtypes(new NamedType(S3InputSource.class, "s3")) ); } ``` -This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose. +This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation. + +Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation. Review comment: Added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367707398 ## File path: docs/development/modules.md ## @@ -148,29 +150,43 @@ To start a segment killing task, you need to access the old Coordinator console After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage. -### Adding a new Firehose +### Adding support for a new input source -There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory. +Adding support for a new input source requires to implement three interfaces, i.e., `InputSource`, `InputEntity`, and `InputSourceReader`. +`InputSource` is to define where the input data is stored. `InputEntity` is to define how data can be read in parallel +in [native parallel indexing](../ingestion/native-batch.md). +`InputSourceReader` defines how to read your new input source and you can simply use the provided `InputEntityIteratingReader` in most cases. -Adding a Firehose is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation +There is an example of this in the `druid-s3-extensions` module with the `S3InputSource` and `S3Entity`. + +Adding an InputSource is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation ``` java @Override public List getJacksonModules() { return ImmutableList.of( - new SimpleModule().registerSubtypes(new NamedType(StaticS3FirehoseFactory.class, "static-s3")) + new SimpleModule().registerSubtypes(new NamedType(S3InputSource.class, "s3")) ); } ``` -This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose. +This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation. + +Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation. + +### Adding support for a new data format + +Adding support for a new data format requires to implement two interfaces, i.e., `InputFormat` and `InputEntityReader`. Review comment: Fixed, thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367707455 ## File path: docs/development/extensions-core/hdfs.md ## @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + + Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| ||---|---|---| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - - +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml + + fs.s3a.impl + org.apache.hadoop.fs.s3a.S3AFileSystem + The implementation class of the S3A Filesystem + + + + fs.AbstractFileSystem.s3a.impl + org.apache.hadoop.fs.s3a.S3A + The implementation class of the S3A AbstractFileSystem. + + + + fs.s3a.access.key + AWS access key ID. Omit for IAM role-based or provider-based authentication. + your access key + + + + fs.s3a.secret.key + AWS secret key. Omit for IAM role-based or provider-based authentication. + your secret key + +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. + Configuration for Google Cloud Storage Review comment: I added `google.cloud.auth.service.account.enable` property. Haven't checked how it works, but just copied from https://github.com/GoogleCloudDataproc/bigdata-interop/blob/master/gcs/INSTALL.md. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575400116 This pull request **fixes 1 alert** when merging c56d895caf30f0b3171ea5cc09615e551adeeae4 into 42359c93dd53f16e52ed79dcd8b63829f4bf2f7b - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-60ba177285eee825f576c2665f6a7661b4aff17a) **fixed alerts:** * 1 for Useless null check This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (bfcb30e -> bd49ec0)
This is an automated email from the ASF dual-hosted git repository. gian pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from bfcb30e Add javadocs and small improvements to join code. (#9196) add bd49ec0 Move result-to-array logic from SQL layer into QueryToolChests. (#9130) No new revisions were added by this update. Summary of changes: .../java/org/apache/druid/query/BaseQuery.java | 1 + .../main/java/org/apache/druid/query/Query.java| 1 + .../org/apache/druid/query/QueryToolChest.java | 49 +++- .../query/groupby/GroupByQueryQueryToolChest.java | 11 + .../apache/druid/query/scan/ScanQueryEngine.java | 6 +- .../druid/query/scan/ScanQueryQueryToolChest.java | 75 ++ .../timeseries/TimeseriesQueryQueryToolChest.java | 43 .../druid/query/topn/TopNQueryQueryToolChest.java | 49 .../druid/query/QueryToolChestTestHelper.java} | 18 +- .../groupby/GroupByQueryQueryToolChestTest.java| 109 + .../query/scan/ScanQueryQueryToolChestTest.java| 205 + .../TimeseriesQueryQueryToolChestTest.java | 64 +- .../query/topn/TopNQueryQueryToolChestTest.java| 72 ++ .../org/apache/druid/server/QueryLifecycle.java| 1 + .../sql/calcite/expression/SimpleExtraction.java | 28 ++- .../apache/druid/sql/calcite/rel/QueryMaker.java | 254 +++-- 16 files changed, 789 insertions(+), 197 deletions(-) copy processing/src/{main/java/org/apache/druid/query/NoopQueryRunner.java => test/java/org/apache/druid/query/QueryToolChestTestHelper.java} (65%) create mode 100644 processing/src/test/java/org/apache/druid/query/scan/ScanQueryQueryToolChestTest.java - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm merged pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests.
gianm merged pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests. URL: https://github.com/apache/druid/pull/9130 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] suneet-s opened a new pull request #9200: Optimize JoinCondition matching
suneet-s opened a new pull request #9200: Optimize JoinCondition matching URL: https://github.com/apache/druid/pull/9200 ### Description The LookupJoinMatcher needs to check if a condition is always true or false multiple times. This can be pre-computed to speed up the match checking This change reduces the time it takes to perform a for joining on a long key from ~ 36 ms/op to 23 ms/ op ![Screen Shot 2020-01-16 at 3 34 16 PM](https://user-images.githubusercontent.com/44787917/72571945-e6a31d00-3875-11ea-8f88-6cecc8a9ee1b.png) This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson opened a new pull request #9199: Fix TSV bugs
jihoonson opened a new pull request #9199: Fix TSV bugs URL: https://github.com/apache/druid/pull/9199 Fixes https://github.com/apache/druid/issues/9156, #9177, and #9188. This PR has: - [x] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367702186 ## File path: docs/ingestion/hadoop.md ## @@ -145,7 +145,51 @@ A type of inputSpec where a static path to the data files is provided. For example, using the static input paths: ``` -"paths" : "s3n://billy-bucket/the/data/is/here/data.gz,s3n://billy-bucket/the/data/is/here/moredata.gz,s3n://billy-bucket/the/data/is/here/evenmoredata.gz" +"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz" +``` + +You can also read from cloud storage such as AWS S3 or Google Cloud Storage. +To do so, you need to install the necessary library under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_. Review comment: Noting here that `${DRUID_HOME}/hadoop-dependencies` doesn't work for this since the HDFS extension needs these libraries on the peon startup This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367697988 ## File path: docs/development/extensions-core/hdfs.md ## @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + + Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| ||---|---|---| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - - +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml + + fs.s3a.impl + org.apache.hadoop.fs.s3a.S3AFileSystem + The implementation class of the S3A Filesystem + + + + fs.AbstractFileSystem.s3a.impl + org.apache.hadoop.fs.s3a.S3A + The implementation class of the S3A AbstractFileSystem. + + + + fs.s3a.access.key + AWS access key ID. Omit for IAM role-based or provider-based authentication. + your access key + + + + fs.s3a.secret.key + AWS secret key. Omit for IAM role-based or provider-based authentication. + your secret key + +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. + Configuration for Google Cloud Storage -Sample spec: +To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. Review comment: Google cloud Storage -> Google Cloud Storage This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367700282 ## File path: docs/development/modules.md ## @@ -148,29 +150,43 @@ To start a segment killing task, you need to access the old Coordinator console After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage. -### Adding a new Firehose +### Adding support for a new input source -There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory. +Adding support for a new input source requires to implement three interfaces, i.e., `InputSource`, `InputEntity`, and `InputSourceReader`. +`InputSource` is to define where the input data is stored. `InputEntity` is to define how data can be read in parallel +in [native parallel indexing](../ingestion/native-batch.md). +`InputSourceReader` defines how to read your new input source and you can simply use the provided `InputEntityIteratingReader` in most cases. -Adding a Firehose is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation +There is an example of this in the `druid-s3-extensions` module with the `S3InputSource` and `S3Entity`. + +Adding an InputSource is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation ``` java @Override public List getJacksonModules() { return ImmutableList.of( - new SimpleModule().registerSubtypes(new NamedType(StaticS3FirehoseFactory.class, "static-s3")) + new SimpleModule().registerSubtypes(new NamedType(S3InputSource.class, "s3")) ); } ``` -This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose. +This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation. + +Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation. Review comment: suggest putting backticks around `@JacksonInject` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367700438 ## File path: docs/development/modules.md ## @@ -148,29 +150,43 @@ To start a segment killing task, you need to access the old Coordinator console After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage. -### Adding a new Firehose +### Adding support for a new input source -There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory. +Adding support for a new input source requires to implement three interfaces, i.e., `InputSource`, `InputEntity`, and `InputSourceReader`. +`InputSource` is to define where the input data is stored. `InputEntity` is to define how data can be read in parallel +in [native parallel indexing](../ingestion/native-batch.md). +`InputSourceReader` defines how to read your new input source and you can simply use the provided `InputEntityIteratingReader` in most cases. -Adding a Firehose is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation +There is an example of this in the `druid-s3-extensions` module with the `S3InputSource` and `S3Entity`. + +Adding an InputSource is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation ``` java @Override public List getJacksonModules() { return ImmutableList.of( - new SimpleModule().registerSubtypes(new NamedType(StaticS3FirehoseFactory.class, "static-s3")) + new SimpleModule().registerSubtypes(new NamedType(S3InputSource.class, "s3")) ); } ``` -This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose. +This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation. + +Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation. + +### Adding support for a new data format + +Adding support for a new data format requires to implement two interfaces, i.e., `InputFormat` and `InputEntityReader`. Review comment: Suggest the following "requires to implement two interfaces, i.e.," -> "requires implementing two interfaces: " This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (42359c9 -> bfcb30e)
This is an automated email from the ASF dual-hosted git repository. gian pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from 42359c9 Implement ANY aggregator (#9187) add bfcb30e Add javadocs and small improvements to join code. (#9196) No new revisions were added by this update. Summary of changes: .../druid/segment/ColumnProcessorFactory.java | 3 ++ .../apache/druid/segment/join/HashJoinEngine.java | 7 ++-- .../druid/segment/join/JoinConditionAnalysis.java | 8 + .../apache/druid/segment/join/JoinableClause.java | 10 +- .../join/PossiblyNullColumnValueSelector.java | 4 +++ .../druid/segment/join/table/IndexedTable.java | 40 ++ .../join/table/IndexedTableJoinMatcher.java| 2 +- 7 files changed, 70 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm merged pull request #9196: Add javadocs and small improvements to join code.
gianm merged pull request #9196: Add javadocs and small improvements to join code. URL: https://github.com/apache/druid/pull/9196 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format
jon-wei commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367692944 ## File path: docs/development/extensions-core/hdfs.md ## @@ -36,49 +36,110 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + + Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| ||---|---|---| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - - +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml + + fs.s3a.impl + org.apache.hadoop.fs.s3a.S3AFileSystem + The implementation class of the S3A Filesystem + + + + fs.AbstractFileSystem.s3a.impl + org.apache.hadoop.fs.s3a.S3A + The implementation class of the S3A AbstractFileSystem. + + + + fs.s3a.access.key + AWS access key ID. Omit for IAM role-based or provider-based authentication. + your access key + + + + fs.s3a.secret.key + AWS secret key. Omit for IAM role-based or provider-based authentication. + your secret key + +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. + Configuration for Google Cloud Storage Review comment: Is there authentication configuration needed for accessing GCS? Could add that in a follow-on PR if so. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on issue #9169: Docker-compose.yml broken after de-incubation cleanup
jon-wei commented on issue #9169: Docker-compose.yml broken after de-incubation cleanup URL: https://github.com/apache/druid/issues/9169#issuecomment-575382444 @nh43de Thanks for the report, we're in the process of migrating to the new repo: https://issues.apache.org/jira/browse/INFRA-19648 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (a87db7f -> 42359c9)
This is an automated email from the ASF dual-hosted git repository. jonwei pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from a87db7f Add HashJoinSegment, a virtual segment for joins. (#9111) add 42359c9 Implement ANY aggregator (#9187) No new revisions were added by this update. Summary of changes: .../apache/druid/java/util/common/StringUtils.java | 22 +++ docs/querying/aggregations.md | 55 ++ docs/querying/sql.md | 4 + .../apache/druid/jackson/AggregatorsModule.java| 10 +- .../druid/query/aggregation/AggregatorUtil.java| 6 + .../DoubleAnyAggregator.java} | 50 ++--- .../DoubleAnyAggregatorFactory.java} | 59 +++--- .../DoubleAnyBufferAggregator.java}| 47 ++--- .../FloatAnyAggregator.java} | 45 ++--- .../FloatAnyAggregatorFactory.java}| 61 +++ .../FloatAnyBufferAggregator.java} | 47 ++--- .../LongAnyAggregator.java}| 48 ++--- .../LongAnyAggregatorFactory.java} | 59 +++--- .../LongAnyBufferAggregator.java} | 47 ++--- .../query/aggregation/any/StringAnyAggregator.java | 82 + .../StringAnyAggregatorFactory.java} | 65 +++ .../aggregation/any/StringAnyBufferAggregator.java | 102 +++ .../aggregation/first/StringFirstAggregator.java | 3 +- .../aggregation/first/StringFirstLastUtils.java| 14 -- .../aggregation/last/StringLastAggregator.java | 3 +- ...or.java => EarliestLatestAnySqlAggregator.java} | 59 -- .../aggregation/builtin/SimpleSqlAggregator.java | 7 +- .../sql/calcite/planner/DruidOperatorTable.java| 7 +- .../apache/druid/sql/calcite/CalciteQueryTest.java | 202 + 24 files changed, 791 insertions(+), 313 deletions(-) copy processing/src/main/java/org/apache/druid/query/aggregation/{first/DoubleFirstAggregator.java => any/DoubleAnyAggregator.java} (55%) copy processing/src/main/java/org/apache/druid/query/aggregation/{DoubleMaxAggregatorFactory.java => any/DoubleAnyAggregatorFactory.java} (65%) copy processing/src/main/java/org/apache/druid/query/aggregation/{last/DoubleLastBufferAggregator.java => any/DoubleAnyBufferAggregator.java} (56%) copy processing/src/main/java/org/apache/druid/query/aggregation/{last/FloatLastAggregator.java => any/FloatAnyAggregator.java} (56%) copy processing/src/main/java/org/apache/druid/query/aggregation/{FloatMinAggregatorFactory.java => any/FloatAnyAggregatorFactory.java} (65%) copy processing/src/main/java/org/apache/druid/query/aggregation/{first/FloatFirstBufferAggregator.java => any/FloatAnyBufferAggregator.java} (56%) copy processing/src/main/java/org/apache/druid/query/aggregation/{first/LongFirstAggregator.java => any/LongAnyAggregator.java} (56%) copy processing/src/main/java/org/apache/druid/query/aggregation/{LongMaxAggregatorFactory.java => any/LongAnyAggregatorFactory.java} (65%) copy processing/src/main/java/org/apache/druid/query/aggregation/{first/LongFirstBufferAggregator.java => any/LongAnyBufferAggregator.java} (56%) create mode 100644 processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyAggregator.java copy processing/src/main/java/org/apache/druid/query/aggregation/{last/StringLastAggregatorFactory.java => any/StringAnyAggregatorFactory.java} (68%) create mode 100644 processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregator.java rename sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/{EarliestLatestSqlAggregator.java => EarliestLatestAnySqlAggregator.java} (77%) - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei merged pull request #9187: Implement ANY aggregator
jon-wei merged pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
lgtm-com[bot] commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575378497 This pull request **fixes 1 alert** when merging 92f2218cf771c70b1173264e96621850bead8ea8 into a87db7f353cdee4dfa9b541063f59d67706d1b07 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-cd9a4712d6ee17be67a5574f81c981254ad0052b) **fixed alerts:** * 1 for Useless null check This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] vogievetsky opened a new pull request #9198: Web console: fix bug where arrays can not be emptied out in the coordinator dialog
vogievetsky opened a new pull request #9198: Web console: fix bug where arrays can not be emptied out in the coordinator dialog URL: https://github.com/apache/druid/pull/9198 Allow the defining of specific empty values in the AutoForm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch 0.17.0 updated: Fix deserialization of maxBytesInMemory (#9092) (#9170)
This is an automated email from the ASF dual-hosted git repository. cwylie pushed a commit to branch 0.17.0 in repository https://gitbox.apache.org/repos/asf/druid.git The following commit(s) were added to refs/heads/0.17.0 by this push: new e6246c9 Fix deserialization of maxBytesInMemory (#9092) (#9170) e6246c9 is described below commit e6246c96f7cce9f7d3b5d17ca2cf27a7963eddc3 Author: Clint Wylie AuthorDate: Thu Jan 16 13:47:11 2020 -0800 Fix deserialization of maxBytesInMemory (#9092) (#9170) * Fix deserialization of maxBytesInMemory * Add maxBytes check Co-authored-by: Atul Mohan --- .../indexing/common/index/RealtimeAppenderatorTuningConfig.java | 1 + .../java/org/apache/druid/indexing/common/task/TaskSerdeTest.java | 6 +- .../org/apache/druid/segment/indexing/RealtimeTuningConfig.java | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/indexing-service/src/main/java/org/apache/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java b/indexing-service/src/main/java/org/apache/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java index b66ccc8..eec9b98 100644 --- a/indexing-service/src/main/java/org/apache/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java +++ b/indexing-service/src/main/java/org/apache/druid/indexing/common/index/RealtimeAppenderatorTuningConfig.java @@ -143,6 +143,7 @@ public class RealtimeAppenderatorTuningConfig implements TuningConfig, Appendera } @Override + @JsonProperty public long getMaxBytesInMemory() { return maxBytesInMemory; diff --git a/indexing-service/src/test/java/org/apache/druid/indexing/common/task/TaskSerdeTest.java b/indexing-service/src/test/java/org/apache/druid/indexing/common/task/TaskSerdeTest.java index 2ba37ff..c5841ea 100644 --- a/indexing-service/src/test/java/org/apache/druid/indexing/common/task/TaskSerdeTest.java +++ b/indexing-service/src/test/java/org/apache/druid/indexing/common/task/TaskSerdeTest.java @@ -394,7 +394,7 @@ public class TaskSerdeTest new RealtimeTuningConfig( 1, -null, +10L, new Period("PT10M"), null, null, @@ -446,6 +446,10 @@ public class TaskSerdeTest task2.getRealtimeIngestionSchema().getTuningConfig().getWindowPeriod() ); Assert.assertEquals( + task.getRealtimeIngestionSchema().getTuningConfig().getMaxBytesInMemory(), + task2.getRealtimeIngestionSchema().getTuningConfig().getMaxBytesInMemory() +); +Assert.assertEquals( task.getRealtimeIngestionSchema().getDataSchema().getGranularitySpec().getSegmentGranularity(), task2.getRealtimeIngestionSchema().getDataSchema().getGranularitySpec().getSegmentGranularity() ); diff --git a/server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java b/server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java index a467944..728e2ff 100644 --- a/server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java +++ b/server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java @@ -174,6 +174,7 @@ public class RealtimeTuningConfig implements TuningConfig, AppenderatorConfig } @Override + @JsonProperty public long getMaxBytesInMemory() { return maxBytesInMemory; - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #9170: [Backport] Fix deserialization of maxBytesInMemory
clintropolis merged pull request #9170: [Backport] Fix deserialization of maxBytesInMemory URL: https://github.com/apache/druid/pull/9170 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] maytasm3 opened a new pull request #9197: Fix LATEST / EARLIEST Buffer Aggregator does not work on String column
maytasm3 opened a new pull request #9197: Fix LATEST / EARLIEST Buffer Aggregator does not work on String column URL: https://github.com/apache/druid/pull/9197 Fix LATEST / EARLIEST Buffer Aggregator does not work on String column ### Description The LATEST / EARLIEST Buffer Aggregator was not working on String column because of incorrectly set limit on the buffer when storing the string. The limit did not take into account the offset position of the start of position to write the string. This PR has: - [x] been self-reviewed. - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths. - [ ] added integration tests. - [x] been tested in a test Druid cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests.
gianm commented on a change in pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests. URL: https://github.com/apache/druid/pull/9130#discussion_r367661337 ## File path: processing/src/main/java/org/apache/druid/query/QueryToolChest.java ## @@ -269,4 +270,50 @@ public ObjectMapper decorateObjectMapper(final ObjectMapper objectMapper, final { return segments; } + + /** + * Returns a list of field names in the order than {@link #resultsAsArrays} would return them. The returned list will Review comment: Yes, it should. I updated it. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] clintropolis commented on a change in pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests.
clintropolis commented on a change in pull request #9130: Move result-to-array logic from SQL layer into QueryToolChests. URL: https://github.com/apache/druid/pull/9130#discussion_r367259286 ## File path: processing/src/main/java/org/apache/druid/query/QueryToolChest.java ## @@ -269,4 +270,50 @@ public ObjectMapper decorateObjectMapper(final ObjectMapper objectMapper, final { return segments; } + + /** + * Returns a list of field names in the order than {@link #resultsAsArrays} would return them. The returned list will Review comment: nit: should this be 'Returns a list of field names in the order _that_ ...' This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367659013 ## File path: website/package-lock.json ## @@ -3913,8 +3913,7 @@ "ansi-regex": { "version": "2.1.1", "bundled": true, - "dev": true, Review comment: Oops, this is not supposed to be added. Reverted all changed in this file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367659050 ## File path: docs/ingestion/data-formats.md ## @@ -63,155 +65,968 @@ _TSV (Delimited)_ Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting. +Besides text formats, Druid also supports binary formats such as [Orc](#orc) and [Parquet](#parquet) formats. + ## Custom Formats Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers. -## Configuration +## Input Format + +> The Input Format is a new way to specify the data format of your input data which was introduced in 0.17.0. +Unfortunately, the Input Format doesn't support all data formats or ingestion methods supported by Druid yet. +Especially if you want to use the Hadoop ingestion, you still need to use the [Parser](#parser-deprecated). +If your data is formatted in some format not listed in this section, please consider using the Parser instead. -All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the`parseSpec` entry in your `dataSchema`. +All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](index.md#ioconfig). ### JSON +The `inputFormat` to load data of JSON format. An example is: + +```json +"ioConfig": { + "inputFormat": { +"type": "json" + }, + ... +} +``` + +The JSON `inputFormat` has the following components: + +| Field | Type | Description | Required | +|---|--|-|--| +| type | String | This should say `json`. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no | +| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no | + +### CSV + +The `inputFormat` to load data of the CSV format. An example is: + +```json +"ioConfig": { + "inputFormat": { +"type": "csv", +"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"] + }, + ... +} +``` + +The CSV `inputFormat` has the following components: + +| Field | Type | Description | Required | +|---|--|-|--| +| type | String | This should say `csv`. | yes | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default == ctrl+A) | +| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing | +| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) | +| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) | + +### TSV (Delimited) + +```json +"ioConfig": { + "inputFormat": { +"type": "tsv", +"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], +"delimiter":"|" + }, + ... +} +``` + +The `inputFormat` to load data of a delimited format. An example is: + +| Field | Type | Description | Required | +|---|--|-|--| +| type | String | This should say `tsv`. | yes | +| delimiter | String | A custom delimiter for data values. | no (default == `\t`) | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default == ctrl+A) | +| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing | +| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367659031 ## File path: docs/ingestion/index.md ## @@ -287,44 +289,31 @@ definition is an _ingestion spec_. Ingestion specs consists of three main components: -- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource), [input row parser](#parser), - [primary timestamp](#timestampspec), [flattening of nested data](#flattenspec) (if needed), - [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed). -- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and . For more information, see the +- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource), + [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed). +- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the documentation for each [ingestion method](#ingestion-methods). - [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each [ingestion method](#ingestion-methods). -Example ingestion spec for task type "index" (native batch): +Example ingestion spec for task type `parallel_index` (native batch): ``` { - "type": "index", + "type": "parallel_index", Review comment: Oops, thanks. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367658914 ## File path: docs/development/extensions-core/hdfs.md ## @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + + Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| ||---|---|---| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - - +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml + + fs.s3a.impl + org.apache.hadoop.fs.s3a.S3AFileSystem + The implementation class of the S3A Filesystem + + + + fs.AbstractFileSystem.s3a.impl + org.apache.hadoop.fs.s3a.S3A + The implementation class of the S3A AbstractFileSystem. + + + + fs.s3a.access.key + AWS access key ID. Omit for IAM role-based or provider-based authentication. + your access key + + + + fs.s3a.secret.key + AWS secret key. Omit for IAM role-based or provider-based authentication. + your secret key + +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. + Configuration for Google Cloud Storage -Sample spec: +To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. -```json -"firehose" : { -"type" : "hdfs", -"paths": "/foo/bar,/foo/baz" -} +|Property|Possible Values|Description|Default| +||---|---|---| +|`druid.storage.type`|hdfs||Must be set.| +|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.| + +All services that need to access GCS need to have the [GCS connector jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md) in their class path. +One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`. + +Finally, you need to add the below properties in the `core-site.xml`. +For
[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format
jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367658976 ## File path: docs/development/extensions-core/kafka-ingestion.md ## @@ -60,22 +60,16 @@ A sample supervisor spec is shown below: "type": "kafka", "dataSchema": { "dataSource": "metrics-kafka", -"parser": { Review comment: Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on issue #9181: Speed up String first/last aggregators when folding isn't needed.
gianm commented on issue #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#issuecomment-575353132 @clintropolis Thanks for reviewing. I updated the patch to reflect your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed.
gianm commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#discussion_r367656532 ## File path: processing/src/test/java/org/apache/druid/query/aggregation/last/StringLastBufferAggregatorTest.java ## @@ -81,6 +82,43 @@ public void testBufferAggregate() } + @Test + public void testBufferAggregateWithFoldCheck() + { +final long[] timestamps = {1526724600L, 1526724700L, 1526724800L, 1526725900L, 1526725000L}; +final String[] strings = {"", "", "", "", ""}; +Integer maxStringBytes = 1024; + +TestLongColumnSelector longColumnSelector = new TestLongColumnSelector(timestamps); +TestObjectColumnSelector objectColumnSelector = new TestObjectColumnSelector<>(strings); + +StringLastAggregatorFactory factory = new StringLastAggregatorFactory( +"billy", "billy", maxStringBytes +); + +StringLastBufferAggregator agg = new StringLastBufferAggregator( +longColumnSelector, +objectColumnSelector, +maxStringBytes, +true +); + +ByteBuffer buf = ByteBuffer.allocate(factory.getMaxIntermediateSize()); +int position = 0; + +agg.init(buf, position); +//noinspection ForLoopReplaceableByForEach +for (int i = 0; i < timestamps.length; i++) { + aggregateBuffer(longColumnSelector, objectColumnSelector, agg, buf, position); +} + +SerializablePairLongString sp = ((SerializablePairLongString) agg.get(buf, position)); + + +Assert.assertEquals("expectec last string value", "", sp.rhs); Review comment: Updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed.
gianm commented on a change in pull request #9181: Speed up String first/last aggregators when folding isn't needed. URL: https://github.com/apache/druid/pull/9181#discussion_r367656434 ## File path: processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java ## @@ -33,23 +36,63 @@ { private static final int NULL_VALUE = -1; + /** + * Shorten "s" to "maxBytes" chars. Fast and loose because these are *chars* not *bytes*. Use + * {@link #chop(String, int)} for slower, but accurate chopping. + */ + @Nullable + public static String fastLooseChop(@Nullable final String s, final int maxBytes) + { +if (s == null || s.length() <= maxBytes) { + return s; +} else { + return s.substring(0, maxBytes); +} + } + + /** + * Shorten "s" to what could fit in "maxBytes" bytes as UTF-8. + */ @Nullable public static String chop(@Nullable final String s, final int maxBytes) { if (s == null) { return null; } else { - // Shorten firstValue to what could fit in maxBytes as UTF-8. final byte[] bytes = new byte[maxBytes]; final int len = StringUtils.toUtf8WithLimit(s, ByteBuffer.wrap(bytes)); return new String(bytes, 0, len, StandardCharsets.UTF_8); } } + /** + * Returns whether a given value selector *might* contain SerializablePairLongString objects. + */ + public static boolean selectorNeedsFoldCheck( + final BaseObjectColumnValueSelector valueSelector, + @Nullable final ColumnCapabilities valueSelectorCapabilities + ) + { +if (valueSelectorCapabilities != null && valueSelectorCapabilities.getType() != ValueType.COMPLEX) { + // Known, non-complex type. + return false; +} + +if (valueSelector instanceof NilColumnValueSelector) { + // Nil column, definitely no SerializablePairLongStrings. + return false; +} + +// Check if the reported class could possibly be SerializablePairLongString. +final Class clazz = valueSelector.classOfObject(); +return clazz.isAssignableFrom(SerializablePairLongString.class) Review comment: I changed it to: ```java // Check if the selector class could possibly be a SerializablePairLongString (either a superclass or subclass). ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm opened a new pull request #9196: Add javadocs and small improvements to join code.
gianm opened a new pull request #9196: Add javadocs and small improvements to join code. URL: https://github.com/apache/druid/pull/9196 A follow-up to #9111. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm merged pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm merged pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[druid] branch master updated (09efd20 -> a87db7f)
This is an automated email from the ASF dual-hosted git repository. gian pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/druid.git. from 09efd20 fix refresh button (#9195) add a87db7f Add HashJoinSegment, a virtual segment for joins. (#9111) No new revisions were added by this update. Summary of changes: .../apache/druid/common/config/NullHandling.java | 23 + .../java/org/apache/druid/math/expr/Exprs.java | 71 + .../druid/common/config/NullHandlingTest.java | 90 ++ .../java/org/apache/druid/math/expr/ExprsTest.java | 99 ++ .../apache/druid/server/lookup/LoadingLookup.java | 38 +- .../apache/druid/server/lookup/PollingLookup.java | 13 + processing/pom.xml |5 + .../query/dimension/DefaultDimensionSpec.java |7 + .../druid/query/dimension/DimensionSpec.java |6 + .../query/dimension/ExtractionDimensionSpec.java |6 + .../query/dimension/ListFilteredDimensionSpec.java |7 + .../druid/query/dimension/LookupDimensionSpec.java | 15 + .../dimension/PrefixFilteredDimensionSpec.java |7 + .../dimension/RegexFilteredDimensionSpec.java |7 + .../druid/query/extraction/MapLookupExtractor.java | 12 + ... VectorValueMatcherColumnProcessorFactory.java} | 20 +- .../druid/query/groupby/GroupByQueryHelper.java|1 + .../epinephelinae/RowBasedGrouperHelper.java |9 +- ...va => GroupByVectorColumnProcessorFactory.java} | 20 +- .../epinephelinae/vector/VectorGroupByEngine.java |2 +- .../apache/druid/query/lookup/LookupExtractor.java | 13 +- .../timeseries/TimeseriesQueryQueryToolChest.java |2 +- .../druid/segment/ColumnProcessorFactory.java | 56 + .../org/apache/druid/segment/ColumnProcessors.java | 144 ++ .../druid/segment/DimensionHandlerUtils.java | 24 +- .../segment/QueryableIndexStorageAdapter.java | 16 +- .../ColumnCapabilities.java => RowAdapter.java}| 28 +- .../RowBasedColumnSelectorFactory.java | 85 +- .../org/apache/druid/segment/StorageAdapter.java |2 - .../VectorColumnProcessorFactory.java} | 23 +- .../org/apache/druid/segment/VirtualColumns.java | 25 +- .../druid/segment/column/ColumnCapabilities.java |2 +- .../apache/druid/segment/filter/BoundFilter.java |4 +- .../segment/filter/DimensionPredicateFilter.java |4 +- .../org/apache/druid/segment/filter/InFilter.java |4 +- .../apache/druid/segment/filter/LikeFilter.java|4 +- .../druid/segment/filter/SelectorFilter.java |4 +- .../segment/incremental/IncrementalIndex.java |2 +- .../IncrementalIndexStorageAdapter.java|6 - .../org/apache/druid/segment/join/Equality.java| 60 + .../apache/druid/segment/join/HashJoinEngine.java | 211 +++ .../apache/druid/segment/join/HashJoinSegment.java | 98 ++ .../join/HashJoinSegmentStorageAdapter.java| 279 .../druid/segment/join/JoinConditionAnalysis.java | 182 +++ .../org/apache/druid/segment/join/JoinMatcher.java | 83 ++ .../org/apache/druid/segment/join/JoinType.java| 89 ++ .../org/apache/druid/segment/join/Joinable.java| 74 ++ .../apache/druid/segment/join/JoinableClause.java | 145 ++ .../join/PossiblyNullColumnValueSelector.java | 86 ++ .../join/PossiblyNullDimensionSelector.java| 191 +++ .../apache/druid/segment/join/PostJoinCursor.java | 121 ++ .../join/lookup/LookupColumnSelectorFactory.java | 113 ++ .../segment/join/lookup/LookupJoinMatcher.java | 312 + .../druid/segment/join/lookup/LookupJoinable.java | 86 ++ .../table/IndexedTable.java} | 51 +- .../table/IndexedTableColumnSelectorFactory.java | 104 ++ .../table/IndexedTableColumnValueSelector.java | 132 ++ .../join/table/IndexedTableDimensionSelector.java | 144 ++ .../join/table/IndexedTableJoinMatcher.java| 310 + .../segment/join/table/IndexedTableJoinable.java | 78 ++ .../segment/join/table/RowBasedIndexedTable.java | 166 +++ .../join/table/SortedIntIntersectionIterator.java | 98 ++ .../druid/segment/transform/Transformer.java |2 +- .../druid/segment/virtual/ExpressionSelectors.java |3 +- .../query/extraction/MapLookupExtractorTest.java | 26 +- .../topn/TopNMetricSpecOptimizationsTest.java |6 - .../druid/segment/filter/BaseFilterTest.java |2 +- .../join/HashJoinSegmentStorageAdapterTest.java| 1390 .../druid/segment/join/HashJoinSegmentTest.java| 138 ++ .../segment/join/JoinConditionAnalysisTest.java| 293 + .../apache/druid/segment/join/JoinTestHelper.java | 351 + .../druid/segment/join/JoinableClauseTest.java | 113 ++ .../join/PossiblyNullDimensionSelectorTest.java| 143 ++ .../join/table/RowBasedIndexedTableTest.java | 183 +++
[GitHub] [druid] jon-wei commented on a change in pull request #9187: Implement ANY aggregator
jon-wei commented on a change in pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187#discussion_r367652349 ## File path: docs/querying/sql.md ## @@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| +|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| Review comment: Hm, looks like the docs are out of date for those, we can fix those later This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367647052 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/SortedIntIntersectionIterator.java ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import com.google.common.base.Preconditions; +import it.unimi.dsi.fastutil.ints.IntIterator; + +import java.util.Arrays; +import java.util.NoSuchElementException; + +/** + * Iterates over the intersection of an array of sorted int lists. Intended for situations where the number Review comment: Oh, I missed that part. Sounds good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367641594 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/SortedIntIntersectionIterator.java ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import com.google.common.base.Preconditions; +import it.unimi.dsi.fastutil.ints.IntIterator; + +import java.util.Arrays; +import java.util.NoSuchElementException; + +/** + * Iterates over the intersection of an array of sorted int lists. Intended for situations where the number Review comment: Even though the next sentence says "The iterators must be composed of ascending, nonnegative ints."? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367591385 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/SortedIntIntersectionIterator.java ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import com.google.common.base.Preconditions; +import it.unimi.dsi.fastutil.ints.IntIterator; + +import java.util.Arrays; +import java.util.NoSuchElementException; + +/** + * Iterates over the intersection of an array of sorted int lists. Intended for situations where the number Review comment: nit: probably better to be `sorted positive int lists`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
jihoonson commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367608679 ## File path: processing/src/main/java/org/apache/druid/segment/join/HashJoinEngine.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import org.apache.druid.query.BaseQuery; +import org.apache.druid.query.dimension.DimensionSpec; +import org.apache.druid.segment.ColumnSelectorFactory; +import org.apache.druid.segment.ColumnValueSelector; +import org.apache.druid.segment.Cursor; +import org.apache.druid.segment.DimensionSelector; +import org.apache.druid.segment.column.ColumnCapabilities; +import org.joda.time.DateTime; + +import javax.annotation.Nonnull; +import javax.annotation.Nullable; + +public class HashJoinEngine +{ + private HashJoinEngine() + { +// No instantiation. + } + + /** + * Creates a cursor that represents the join of {@param leftCursor} with {@param joinableClause}. The resulting + * cursor may generate nulls on the left-hand side (for righty joins; see {@link JoinType#isRighty()}) or on + * the right-hand side (for lefty joins; see {@link JoinType#isLefty()}). Columns that start with the + * joinable clause's prefix (see {@link JoinableClause#getPrefix()}) will come from the Joinable's column selector + * factory, and all other columns will come from the leftCursor's column selector factory. + * + * Ensuing that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the + * responsibility of the caller. + */ + public static Cursor makeJoinCursor(final Cursor leftCursor, final JoinableClause joinableClause) + { +final ColumnSelectorFactory leftColumnSelectorFactory = leftCursor.getColumnSelectorFactory(); +final JoinMatcher joinMatcher = joinableClause.getJoinable() + .makeJoinMatcher( + leftColumnSelectorFactory, + joinableClause.getCondition(), + joinableClause.getJoinType().isRighty() + ); + +class JoinColumnSelectorFactory implements ColumnSelectorFactory +{ + @Override + public DimensionSelector makeDimensionSelector(DimensionSpec dimensionSpec) + { +if (joinableClause.includesColumn(dimensionSpec.getDimension())) { + return joinMatcher.getColumnSelectorFactory() +.makeDimensionSelector( + dimensionSpec.withDimension(joinableClause.unprefix(dimensionSpec.getDimension())) +); +} else { + final DimensionSelector leftSelector = leftColumnSelectorFactory.makeDimensionSelector(dimensionSpec); + + if (!joinableClause.getJoinType().isRighty()) { +return leftSelector; + } else { +return new PossiblyNullDimensionSelector(leftSelector, joinMatcher::matchingRemainder); + } +} + } + + @Override + public ColumnValueSelector makeColumnValueSelector(String column) + { +if (joinableClause.includesColumn(column)) { + return joinMatcher.getColumnSelectorFactory().makeColumnValueSelector(joinableClause.unprefix(column)); +} else { + final ColumnValueSelector leftSelector = leftColumnSelectorFactory.makeColumnValueSelector(column); + + if (!joinableClause.getJoinType().isRighty()) { +return leftSelector; + } else { +return new PossiblyNullColumnValueSelector<>(leftSelector, joinMatcher::matchingRemainder); + } +} + } + + @Nullable + @Override + public ColumnCapabilities getColumnCapabilities(String column) + { +if (joinableClause.includesColumn(column)) { + return joinMatcher.getColumnSelectorFactory().getColumnCapabilities(joinableClause.unprefix(column)); +} else { +
[GitHub] [druid] maytasm3 commented on a change in pull request #9187: Implement ANY aggregator
maytasm3 commented on a change in pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187#discussion_r367628285 ## File path: sql/src/test/java/org/apache/druid/sql/calcite/util/CalciteTests.java ## @@ -377,6 +377,15 @@ public AuthenticationResult createEscalatedAuthenticationResult() ); public static final List ROWS1_WITH_NUMERIC_DIMS = ImmutableList.of( + createRow( Review comment: Actually, I think it's fine to just test with the same numfoo datasource (with first row being non-null) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] maytasm3 commented on a change in pull request #9187: Implement ANY aggregator
maytasm3 commented on a change in pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187#discussion_r367624495 ## File path: docs/querying/sql.md ## @@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| +|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| +|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| Review comment: Let's discuss. We can change this behaviour for LATEST, EARLIEST (and ANY) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] maytasm3 commented on a change in pull request #9187: Implement ANY aggregator
maytasm3 commented on a change in pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187#discussion_r367624288 ## File path: docs/querying/sql.md ## @@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| +|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| +|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| Review comment: Currently, the implementation for LATEST, EARLIEST (and ANY since I based it off LATEST, EARLIEST) is that if you use the json stuff, then maxStringBytes is optional and if not present will default to 1024 (as per the docs in docs/querying/aggregations.md). However, this does not work the same if you issue the query through SQL. To use LATEST, EARLIEST (and ANY) in SQL, you must give the maxStringBytes as the second argument. If you do not, then the column actually gets cast into double (super weird). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] jon-wei commented on a change in pull request #9187: Implement ANY aggregator
jon-wei commented on a change in pull request #9187: Implement ANY aggregator URL: https://github.com/apache/druid/pull/9187#discussion_r367616811 ## File path: docs/querying/sql.md ## @@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| +|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| +|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| Review comment: you have this block in StringAnyAggregatorFactory: ``` this.maxStringBytes = maxStringBytes == null ? StringFirstAggregatorFactory.DEFAULT_MAX_STRING_SIZE : maxStringBytes; ``` I would give the SQL function consistent behavior This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367616393 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTable.java ## @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import it.unimi.dsi.fastutil.ints.IntList; +import org.apache.druid.segment.column.ValueType; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Map; + +public interface IndexedTable Review comment: I'm thinking of adding this: ```java /** * An interface to a table where some columns (the 'key columns') have indexes that enable fast lookups. * * The main user of this class is {@link IndexedTableJoinable}, and its main purpose is to participate in joins. */ public interface IndexedTable { /** * Returns the columns of this table that have indexes. */ List keyColumns(); /** * Returns all columns of this table, including the key and non-key columns. */ List allColumns(); /** * Returns the signature of this table: a map where each key is a column from {@link #allColumns()} and each value * is a type code. */ Map rowSignature(); /** * Returns the number of rows in this table. It must not change over time, since it is used for things like algorithm * selection and reporting of cardinality metadata. */ int numRows(); /** * Returns the index for a particular column. The provided column number must be that column's position in * {@link #allColumns()}. */ Index columnIndex(int column); /** * Returns a reader for a particular column. The provided column number must be that column's position in * {@link #allColumns()}. */ Reader columnReader(int column); /** * Indexes support fast lookups on key columns. */ interface Index { /** * Returns the list of row numbers where the column this Reader is based on contains 'key'. */ IntList find(Object key); } /** * Readers support reading values out of any column. */ interface Reader { /** * Read the value at a particular row number. Throws an exception if the row is out of bounds (must be between zero * and {@link #numRows()}). */ @Nullable Object read(int row); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367615823 ## File path: processing/src/main/java/org/apache/druid/segment/join/HashJoinEngine.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import org.apache.druid.query.BaseQuery; +import org.apache.druid.query.dimension.DimensionSpec; +import org.apache.druid.segment.ColumnSelectorFactory; +import org.apache.druid.segment.ColumnValueSelector; +import org.apache.druid.segment.Cursor; +import org.apache.druid.segment.DimensionSelector; +import org.apache.druid.segment.column.ColumnCapabilities; +import org.joda.time.DateTime; + +import javax.annotation.Nonnull; +import javax.annotation.Nullable; + +public class HashJoinEngine +{ + private HashJoinEngine() + { +// No instantiation. + } + + /** + * Creates a cursor that represents the join of {@param leftCursor} with {@param joinableClause}. The resulting + * cursor may generate nulls on the left-hand side (for righty joins; see {@link JoinType#isRighty()}) or on + * the right-hand side (for lefty joins; see {@link JoinType#isLefty()}). Columns that start with the + * joinable clause's prefix (see {@link JoinableClause#getPrefix()}) will come from the Joinable's column selector + * factory, and all other columns will come from the leftCursor's column selector factory. + * + * Ensuing that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the Review comment: Oops, yeah, that's a typo. It should be "ensuring". Is this clearer? ```java /** * Ensuring that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the * responsibility of the caller. If there is such a conflict (for example, if the joinable clause's prefix is "j.", * and the leftCursor has a field named "j.j.abrams"), then the field from the leftCursor will be shadowed and will * not be queryable through the returned Cursor. This happens even if the right-hand joinable doesn't actually have a * column with this name. */ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367615823 ## File path: processing/src/main/java/org/apache/druid/segment/join/HashJoinEngine.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import org.apache.druid.query.BaseQuery; +import org.apache.druid.query.dimension.DimensionSpec; +import org.apache.druid.segment.ColumnSelectorFactory; +import org.apache.druid.segment.ColumnValueSelector; +import org.apache.druid.segment.Cursor; +import org.apache.druid.segment.DimensionSelector; +import org.apache.druid.segment.column.ColumnCapabilities; +import org.joda.time.DateTime; + +import javax.annotation.Nonnull; +import javax.annotation.Nullable; + +public class HashJoinEngine +{ + private HashJoinEngine() + { +// No instantiation. + } + + /** + * Creates a cursor that represents the join of {@param leftCursor} with {@param joinableClause}. The resulting + * cursor may generate nulls on the left-hand side (for righty joins; see {@link JoinType#isRighty()}) or on + * the right-hand side (for lefty joins; see {@link JoinType#isLefty()}). Columns that start with the + * joinable clause's prefix (see {@link JoinableClause#getPrefix()}) will come from the Joinable's column selector + * factory, and all other columns will come from the leftCursor's column selector factory. + * + * Ensuing that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the Review comment: Oops, yeah, that's a typo. It should be "ensuring". Is this clearer? ```java * Ensuring that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the * responsibility of the caller. If there is such a conflict (for example, if the joinable clause's prefix is "j.", * and the leftCursor has a field named "j.j.abrams"), then the field from the leftCursor will be shadowed and will * not be queryable through the returned Cursor. This happens even if the right-hand joinable doesn't actually have a * column with this name. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367614421 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java ## @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.Pair; +import org.apache.druid.math.expr.Expr; +import org.apache.druid.math.expr.ExprMacroTable; +import org.apache.druid.math.expr.Exprs; +import org.apache.druid.math.expr.Parser; +import org.apache.druid.query.expression.ExprUtils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import java.util.Optional; + +/** + * Represents analysis of a join condition. + * + * Each condition is decomposed into "equiConditions" and "nonEquiConditions". + * + * 1) The equiConditions are of the form ExpressionOfLeft = ColumnFromRight. The right-hand part cannot be an expression + * because we use this analysis to determine if we can perform the join using hashtables built off right-hand-side + * columns. + * + * 2) The nonEquiConditions are other conditions that should also be ANDed together + * + * All of these conditions are ANDed together to get the overall condition. + */ +public class JoinConditionAnalysis +{ + private final String originalExpression; + private final List equiConditions; + private final List nonEquiConditions; + + private JoinConditionAnalysis( + final String originalExpression, + final List equiConditions, + final List nonEquiConditions + ) + { +this.originalExpression = Preconditions.checkNotNull(originalExpression, "originalExpression"); +this.equiConditions = equiConditions; +this.nonEquiConditions = nonEquiConditions; + } + + public static JoinConditionAnalysis forExpression( + final String condition, Review comment: I'm thinking of adding this javadoc: ```java /** * Analyze a join condition. * * @param condition the condition expression * @param rightPrefix prefix for the right-hand side of the join; will be used to determine which identifiers in *the condition come from the right-hand side and which come from the left-hand side * @param macroTable macro table for parsing the condition expression */ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367613419 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java ## @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.Pair; +import org.apache.druid.math.expr.Expr; +import org.apache.druid.math.expr.ExprMacroTable; +import org.apache.druid.math.expr.Exprs; +import org.apache.druid.math.expr.Parser; +import org.apache.druid.query.expression.ExprUtils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import java.util.Optional; + +/** + * Represents analysis of a join condition. + * + * Each condition is decomposed into "equiConditions" and "nonEquiConditions". + * + * 1) The equiConditions are of the form ExpressionOfLeft = ColumnFromRight. The right-hand part cannot be an expression + * because we use this analysis to determine if we can perform the join using hashtables built off right-hand-side + * columns. + * + * 2) The nonEquiConditions are other conditions that should also be ANDed together + * + * All of these conditions are ANDed together to get the overall condition. + */ +public class JoinConditionAnalysis +{ + private final String originalExpression; + private final List equiConditions; + private final List nonEquiConditions; + + private JoinConditionAnalysis( + final String originalExpression, + final List equiConditions, + final List nonEquiConditions + ) + { +this.originalExpression = Preconditions.checkNotNull(originalExpression, "originalExpression"); +this.equiConditions = equiConditions; +this.nonEquiConditions = nonEquiConditions; + } + + public static JoinConditionAnalysis forExpression( + final String condition, Review comment: Yes, and that's because the way to think about the prefixes is that they aren't table names (which might be present or not), they are column name prefixes. They are mandatory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367612949 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTable.java ## @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import it.unimi.dsi.fastutil.ints.IntList; +import org.apache.druid.segment.column.ValueType; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Map; + +public interface IndexedTable Review comment: Sure, good call. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367608934 ## File path: processing/src/main/java/org/apache/druid/segment/ColumnProcessorFactory.java ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment; + +import org.apache.druid.query.dimension.ColumnSelectorStrategyFactory; +import org.apache.druid.segment.column.ValueType; + +/** + * Class that encapsulates knowledge about how to create "column processors", which are... objects that process columns + * and want to have type-specific logic. Used by {@link ColumnProcessors#makeProcessor}. + * + * Column processors can be any type "T". The idea is that a ColumnProcessorFactory embodies the logic for wrapping + * and processing selectors of various types, and so enables nice code design, where type-dependent code is not + * sprinkled throughout. + * + * @see VectorColumnProcessorFactory the vectorized version + * @see ColumnProcessors#makeProcessor which uses these, and which is responsible for + * determining which type of selector to use for a given column + * @see ColumnSelectorStrategyFactory which serves a similar purpose and may be replaced by this in the future + * @see DimensionHandlerUtils#createColumnSelectorPluses which accepts {@link ColumnSelectorStrategyFactory} and is + * similar to {@link ColumnProcessors#makeProcessor} + */ +public interface ColumnProcessorFactory +{ + /** + * This default type will be used when the underlying column has an unknown type. + */ + ValueType defaultType(); Review comment: I'm thinking about adding this javadoc: ```java /** * This default type will be used when the underlying column has an unknown type. * * This allows a column processor factory to specify what type it prefers to deal with (the most 'natural' type for * whatever it is doing) when all else is equal. */ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367607423 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.IAE; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Represents everything about a join clause except for the left-hand datasource. In other words, if the full join + * clause is "t1 JOIN t2 ON t1.x = t2.x" then this class represents "JOIN t2 ON x = t2.x" -- it does not include + * references to the left-hand "t1". + */ +public class JoinableClause +{ + private final String prefix; Review comment: I'm planning to add this javadoc, which'll make it clearer: ```java /** * The prefix to apply to all columns from the Joinable. The idea is that during a join, any columns that start with * this prefix should be retrieved from our Joinable's {@link JoinMatcher#getColumnSelectorFactory()}. Any other * columns should be returned from the left-hand side of the join. * * The prefix can be any string, as long as it is nonempty and not itself a prefix of the reserved column name * {@code __time}. * * @see #getAvailableColumnsPrefixed() the list of columns from our {@link Joinable} with prefixes attached * @see #unprefix a method for removing prefixes */ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367607423 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.IAE; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Represents everything about a join clause except for the left-hand datasource. In other words, if the full join + * clause is "t1 JOIN t2 ON t1.x = t2.x" then this class represents "JOIN t2 ON x = t2.x" -- it does not include + * references to the left-hand "t1". + */ +public class JoinableClause +{ + private final String prefix; Review comment: I'm planning to add this javadoc, which'll make it clearer: ```java /** * The prefix to apply to all columns from the Joinable. The idea is that during a join, any columns that start with * this prefix should be retrieved from our Joinable's {@link JoinMatcher#getColumnSelectorFactory()}. Any other * columns should be returned from the left-hand side of the join. * * @see #getAvailableColumnsPrefixed() the list of columns from our {@link Joinable} with prefixes attached * @see #unprefix a method for removing prefixes */ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367606130 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.IAE; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Represents everything about a join clause except for the left-hand datasource. In other words, if the full join + * clause is "t1 JOIN t2 ON t1.x = t2.x" then this class represents "JOIN t2 ON x = t2.x" -- it does not include + * references to the left-hand "t1". + */ +public class JoinableClause +{ + private final String prefix; + private final Joinable joinable; + private final JoinType joinType; + private final JoinConditionAnalysis condition; + + public JoinableClause(@Nullable String prefix, Joinable joinable, JoinType joinType, JoinConditionAnalysis condition) + { +this.prefix = prefix != null ? prefix : ""; +this.joinable = Preconditions.checkNotNull(joinable, "joinable"); +this.joinType = Preconditions.checkNotNull(joinType, "joinType"); +this.condition = Preconditions.checkNotNull(condition, "condition"); + } + + /** + * The prefix to apply to all columns from the Joinable. + */ + public String getPrefix() + { +return prefix; + } + + /** + * The right-hand Joinable. + */ + public Joinable getJoinable() + { +return joinable; + } + + /** + * The type of join: LEFT, RIGHT, INNER, or FULL. + */ + public JoinType getJoinType() + { +return joinType; + } + + /** + * The join condition. When referring to right-hand columns, it should include the prefix. + */ + public JoinConditionAnalysis getCondition() + { +return condition; + } + + /** + * Returns a list of columns from the underlying {@link Joinable#getAvailableColumns()} method, with our + * prefix ({@link #getPrefix()}) prepended. + */ + public List getAvailableColumnsPrefixed() Review comment: I actually like that the word "prefix" is in here since it makes the connection with `getPrefix` and `unprefix` more clear. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367605830 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.IAE; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Represents everything about a join clause except for the left-hand datasource. In other words, if the full join + * clause is "t1 JOIN t2 ON t1.x = t2.x" then this class represents "JOIN t2 ON x = t2.x" -- it does not include + * references to the left-hand "t1". + */ +public class JoinableClause +{ + private final String prefix; Review comment: It's whatever the caller wants it to be, really. The SQL layer is gonna use strings like `_j0.`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367605830 ## File path: processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join; + +import com.google.common.base.Preconditions; +import org.apache.druid.java.util.common.IAE; + +import javax.annotation.Nullable; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Represents everything about a join clause except for the left-hand datasource. In other words, if the full join + * clause is "t1 JOIN t2 ON t1.x = t2.x" then this class represents "JOIN t2 ON x = t2.x" -- it does not include + * references to the left-hand "t1". + */ +public class JoinableClause +{ + private final String prefix; Review comment: It's whatever the caller wants it to be, really. The SQL layer is gonna use strings like `_j0.` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367604965 ## File path: processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java ## @@ -17,25 +17,32 @@ * under the License. */ -package org.apache.druid.query.dimension; +package org.apache.druid.segment; import org.apache.druid.segment.vector.MultiValueDimensionVectorSelector; import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector; import org.apache.druid.segment.vector.VectorValueSelector; /** * Class that encapsulates knowledge about how to create vector column processors. Used by - * {@link org.apache.druid.segment.DimensionHandlerUtils#makeVectorProcessor}. + * {@link DimensionHandlerUtils#makeVectorProcessor}. + * + * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method. The default type is + * always implicitly STRING. It also does not have a "makeComplexProcessor" method; instead, complex-typed columns Review comment: I imagined it's a temporary thing. I would eventually like the two column processor factory interfaces to match up better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367605135 ## File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java ## @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.join.table; + +import com.google.common.base.Preconditions; +import it.unimi.dsi.fastutil.ints.IntIterator; +import it.unimi.dsi.fastutil.ints.IntIterators; +import it.unimi.dsi.fastutil.ints.IntRBTreeSet; +import it.unimi.dsi.fastutil.ints.IntSet; +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.java.util.common.IAE; +import org.apache.druid.segment.BaseDoubleColumnValueSelector; +import org.apache.druid.segment.BaseFloatColumnValueSelector; +import org.apache.druid.segment.BaseLongColumnValueSelector; +import org.apache.druid.segment.BaseObjectColumnValueSelector; +import org.apache.druid.segment.ColumnProcessorFactory; +import org.apache.druid.segment.ColumnProcessors; +import org.apache.druid.segment.ColumnSelectorFactory; +import org.apache.druid.segment.DimensionSelector; +import org.apache.druid.segment.column.ValueType; +import org.apache.druid.segment.data.IndexedInts; +import org.apache.druid.segment.join.Equality; +import org.apache.druid.segment.join.JoinConditionAnalysis; +import org.apache.druid.segment.join.JoinMatcher; + +import javax.annotation.Nullable; +import java.util.Collections; +import java.util.List; +import java.util.NoSuchElementException; +import java.util.function.Supplier; +import java.util.stream.Collectors; + +public class IndexedTableJoinMatcher implements JoinMatcher +{ + private final IndexedTable table; + private final List> conditionMatchers; + private final IntIterator[] currentMatchedRows; + private final ColumnSelectorFactory selectorFactory; + + // matchedRows and matchingRemainder are used to implement matchRemainder(). + private final IntSet matchedRows; + private boolean matchingRemainder = false; + + // currentIterator and currentRow are used to track iteration position through the currently-matched-rows. + @Nullable + private IntIterator currentIterator; + private int currentRow; + + IndexedTableJoinMatcher( + final IndexedTable table, + final ColumnSelectorFactory leftSelectorFactory, + final JoinConditionAnalysis condition, + final boolean remainderNeeded + ) + { +this.table = table; + +if (condition.isAlwaysTrue()) { + this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, table.numRows())); +} else if (condition.isAlwaysFalse()) { + this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, 0)); Review comment: Yeah, good point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins.
gianm commented on a change in pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111#discussion_r367604763 ## File path: processing/src/main/java/org/apache/druid/segment/ColumnProcessorFactory.java ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment; + +import org.apache.druid.query.dimension.ColumnSelectorStrategyFactory; +import org.apache.druid.segment.column.ValueType; + +/** + * Class that encapsulates knowledge about how to create "column processors", which are... objects that process columns + * and want to have type-specific logic. Used by {@link ColumnProcessors#makeProcessor}. + * + * Column processors can be any type "T". The idea is that a ColumnProcessorFactory embodies the logic for wrapping + * and processing selectors of various types, and so enables nice code design, where type-dependent code is not + * sprinkled throughout. + * + * @see VectorColumnProcessorFactory the vectorized version + * @see ColumnProcessors#makeProcessor which uses these, and which is responsible for + * determining which type of selector to use for a given column + * @see ColumnSelectorStrategyFactory which serves a similar purpose and may be replaced by this in the future + * @see DimensionHandlerUtils#createColumnSelectorPluses which accepts {@link ColumnSelectorStrategyFactory} and is + * similar to {@link ColumnProcessors#makeProcessor} + */ +public interface ColumnProcessorFactory +{ + /** + * This default type will be used when the underlying column has an unknown type. + */ + ValueType defaultType(); Review comment: It's meant to be the preferred type that the processor wants to deal with in situations where there is no type information for the underlying column. It should usually be related to whatever the processor wants to _do_ with the data. The idea is that you would return `STRING` if you prefer to deal with strings, `DOUBLE` (or `LONG`) if you prefer to deal with numbers, etc. Does that make sense / sound reasonable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org