[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-05-24 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023456#comment-16023456
 ] 

Joseph Witt commented on NIFI-3644:
---

[~bjorn.ols...@gmail.com] and [~bbende] this is a really cool addition.  Nice 
work and thanks!

> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Assignee: Bryan Bende
>Priority: Minor
> Fix For: 1.3.0
>
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023451#comment-16023451
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/1645


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
> Fix For: 1.3.0
>
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-05-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023450#comment-16023450
 ] 

ASF subversion and git services commented on NIFI-3644:
---

Commit ae3db823037ef01f8dc123e494f1d9e6522f29fe in nifi's branch 
refs/heads/master from [~bbende]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=ae3db82 ]

NIFI-3644 Fixing the result handler in HBase_1_1_2_ClientMapCacheService to use 
the offsets for the value bytes

This closes #1645.

Signed-off-by: Bryan Bende 


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023328#comment-16023328
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on the issue:

https://github.com/apache/nifi/pull/1645
  
Sorry for taking so long to get back to this...

I tested this using PutDistributedMapCache and FetchDistributedMapCache, 
and noticed the value coming back from fetch wasn't exactly what I had stored. 

In HBaseRowHandler we had:
`lastResultBytes = resultCell.getValueArray()`

And we need:
`lastResultBytes = Arrays.copyOfRange(resultCell.getValueArray(), 
resultCell.getValueOffset(), resultCell.getValueLength() + 
resultCell.getValueOffset());
`

I made a commit here that includes the change:

https://github.com/bbende/nifi/commit/dc8f14d95d6cdbab2aa6e815269fe0d98faa2fe6

I also moved MockHBaseClientService into it's own class so it can be used 
by both tests, so that we don't have to duplicate that code.

Everything else looks good so I will go ahead and merge these changes 
together (your commit then mine). 

Thanks again for contributing! and sorry for the delay.



> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022295#comment-16022295
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user joewitt commented on the issue:

https://github.com/apache/nifi/pull/1645
  
This looks like it could be pretty helpful!  I wonder if in light of the 
recent LookupService work we should consider exposing/using this via that 
interface instead of or in addition to this distributed cache one.  Thoughts?


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962767#comment-15962767
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user baolsen commented on the issue:

https://github.com/apache/nifi/pull/1645
  
@bbende 
Ready for another review! I've updated per your comments and added some 
unit tests.


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-06 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959732#comment-15959732
 ] 

Joseph Witt commented on NIFI-3644:
---

[~bjorn.ols...@gmail.com] Very cool that you've taken the feedback and done so 
much with it!  Thanks also to [~bbende] for reviewing and helping bjorn make 
this happen.  I suspect this will be a very popular feature!

> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959729#comment-15959729
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r110264255
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayList

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959367#comment-15959367
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user baolsen commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r110219758
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayLis

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959086#comment-15959086
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user baolsen commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r110191840
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayLis

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955646#comment-15955646
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r109728311
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
--- End diff --

For the table, col fal, and col qual, you may want to support expression 
language. There are obviously no flow files in this case, but if you have 
expressionLanguageSupported(true) on the property descriptors and then when you 
get the values .evaluateAttributeExpressions(), this would let someone 
reference an environment variable if they want to specify a different table 
across environments,


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBa

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955645#comment-15955645
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r109752585
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayList

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955647#comment-15955647
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r109728428
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayList

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955644#comment-15955644
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user bbende commented on a diff in the pull request:

https://github.com/apache/nifi/pull/1645#discussion_r109727737
  
--- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-hbase_1_1_2-client-service-bundle/nifi-hbase_1_1_2-client-service/src/main/java/org/apache/nifi/hbase/HBase_1_1_2_ClientMapCacheService.java
 ---
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.hbase;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.SeeAlso;
+import org.apache.nifi.annotation.documentation.Tags;
+import org.apache.nifi.annotation.lifecycle.OnEnabled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.controller.AbstractControllerService;
+import org.apache.nifi.controller.ConfigurationContext;
+
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import java.io.ByteArrayOutputStream;
+import org.apache.nifi.reporting.InitializationException;
+
+import java.nio.charset.StandardCharsets;
+import org.apache.nifi.hbase.scan.ResultCell;
+import org.apache.nifi.hbase.scan.ResultHandler;
+import org.apache.nifi.hbase.scan.Column;
+import org.apache.nifi.hbase.put.PutColumn;
+
+
+import org.apache.nifi.processor.util.StandardValidators;
+
+@Tags({"distributed", "cache", "state", "map", "cluster","hbase"})
+@SeeAlso(classNames = 
{"org.apache.nifi.distributed.cache.server.map.DistributedMapCacheClient", 
"org.apache.nifi.hbase.HBase_1_1_2_ClientService"})
+@CapabilityDescription("Provides the ability to use an HBase table as a 
cache, in place of a DistributedMapCache."
++ " Uses a HBase_1_1_2_ClientService controller to communicate with 
HBase.")
+
+public class HBase_1_1_2_ClientMapCacheService extends 
AbstractControllerService implements DistributedMapCacheClient {
+
+static final PropertyDescriptor HBASE_CLIENT_SERVICE = new 
PropertyDescriptor.Builder()
+.name("HBase Client Service")
+.description("Specifies the HBase Client Controller Service to use 
for accessing HBase.")
+.required(true)
+.identifiesControllerService(HBaseClientService.class)
+.build();
+
+public static final PropertyDescriptor HBASE_CACHE_TABLE_NAME = new 
PropertyDescriptor.Builder()
+.name("HBase Cache Table Name")
+.description("Name of the table on HBase to use for the cache.")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_FAMILY = new 
PropertyDescriptor.Builder()
+.name("HBase Column Family")
+.description("Name of the column family on HBase to use for the 
cache.")
+.required(true)
+.defaultValue("f")
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+public static final PropertyDescriptor HBASE_COLUMN_QUALIFIER = new 
PropertyDescriptor.Builder()
+.name("HBase Column Qualifier")
+.description("Name of the column qualifier on HBase to use for the 
cache")
+.defaultValue("q")
+.required(true)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
+@Override
+protected List getSupportedPropertyDescriptors() {
+final List descriptors = new ArrayList

[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954708#comment-15954708
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

Github user baolsen commented on the issue:

https://github.com/apache/nifi/pull/1645
  
Hi @bbende, please take a look at this PR.

I've added HBase_1_1_2_ClientMapCacheService as a controller service which 
uses the HBase_1_1_2_ClientService to store a cache of values on HBase. 

Can be used in the DetectDuplicate processor in place of a 
DistributedMapCache (and other processors as well)

Travis build is passing 4/5, not sure why one of the languages would fail.
The AppVeyor build is failing on a specific test 
"TestListFile.testAttributesSet" which I don't think is mine.

Let me know what you think.
Thanks!


> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-04-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952671#comment-15952671
 ] 

ASF GitHub Bot commented on NIFI-3644:
--

GitHub user baolsen opened a pull request:

https://github.com/apache/nifi/pull/1645

NIFI-3644 - Added HBase_1_1_2_ClientMapCacheService

Added HBase_1_1_2_ClientMapCacheService which implements 
DistributedMapCacheClient.
The DetectDuplicate processor can now make use of 
HBase_1_1_2_ClientMapCacheService for storing the duplicate cache on HBase.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/baolsen/nifi 
DistributedMapCacheHBaseClientService

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/1645.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1645


commit 8c0285b5efb6afd1607bb050650b758fed7d06e3
Author: baolsen 
Date:   2017-03-23T12:35:43Z

Update HBaseClientService.java

Added "get" function call for doing single row lookup on HBase (HBase get)

commit 03d1b36376c6954d8bdcf4056314fced0cf0d1fc
Author: baolsen 
Date:   2017-03-23T13:20:41Z

Update HBase_1_1_2_ClientService.java

Implemented "get" function for retrieval of single HBase rows.

commit 6dbca10e82b3b6b8ac94f8f0152b8fff85008082
Author: baolsen 
Date:   2017-03-23T13:33:15Z

Update HBase_1_1_2_ClientService.java

commit df30a22a3ba71fedfe1dffedefcc0eb64c3670b0
Author: baolsen 
Date:   2017-03-23T13:40:08Z

Update HBase_1_1_2_ClientService.java

commit 6d8036cc03ef49e41b92dbb5fa7e0de41cc15c3d
Author: baolsen 
Date:   2017-03-23T13:44:12Z

Update MockHBaseClientService.java

Implemented "get" function with UnsupportedException

commit 4bcb26fd6a99a23852097f4f3db02cbeb6b8a3b5
Author: baolsen 
Date:   2017-03-23T13:46:23Z

Update HBase_1_1_2_ClientService.java

commit 4b266d9d1d112e2bf8aa198f87253d17c055dbbc
Author: baolsen 
Date:   2017-03-23T13:50:09Z

Update MockHBaseClientService.java

commit 2ef850bc7c2bce5f9dd35fc9ce5cf08c7ecf07c4
Author: baolsen 
Date:   2017-03-29T08:51:11Z

Test

commit e802f147bcd19664b9053e240ec1476ff7a61e7b
Author: baolsen 
Date:   2017-03-29T08:52:35Z

Test

commit 4cabff26658090c08d813e74d27894a9fd684c57
Author: baolsen 
Date:   2017-03-31T07:59:50Z

Completed initial development of HBase_1_1_2_ClientMapCacheService.java 
which is compatible with DetectDuplicate (and other processors)
Still need to implement value deletion

commit 7790d3f5a8d56f0801d40ad2c836a8db7c123e1b
Author: baolsen 
Date:   2017-03-31T08:31:06Z

Undid changes to files for an earlier attempt at this

commit 594dc059cdbe708f10849c794b826d24e83e787d
Author: baolsen 
Date:   2017-03-31T08:33:47Z

Undid changes to files for an earlier attempt at this

commit fbd3034e736ecdd1d721cc788e5c984eee6560c7
Author: baolsen 
Date:   2017-04-02T13:01:21Z

Added remove() for cache and Documentation




> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-03-24 Thread Bjorn Olsen (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940467#comment-15940467
 ] 

Bjorn Olsen commented on NIFI-3644:
---

Hi Joe

Thanks for the suggestion, I hadn't considered writing an HBase version of 
DistributedMapCache. 

I've already written my own DetectDuplicateUsingHBase processor today, as I 
needed something that was quick to develop.

Code here, much copy-pasta from DetectDuplicate:
https://github.com/baolsen/nifi/blob/DetectDuplicateUsingHBase/nifi-nar-bundles/nifi-hbase-bundle/nifi-hbase-processors/src/main/java/org/apache/nifi/hbase/DetectDuplicateUsingHBase.java

It seems that implementing an HBase-based DistributedMapCache is more complex, 
but more reusable. 
Do you have any suggestions for documentation for this sort of thing?

Lastly, do you think it is worth including DetectDuplicateUsingHBase or rather 
wait for a more reusable option?

I'm a bit tight on time, and Java and NiFi are both new to me.
Meanwhile I can keep DetectDuplicateUsingHBase for my own use, so no worries 
there.

> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-03-24 Thread Joseph Witt (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940247#comment-15940247
 ] 

Joseph Witt commented on NIFI-3644:
---

Bjorn,

We can add you to the contributors list in JIRA so that you can assign items to 
yourself.  However, in the meantime you can definitely contribute and work on 
tasks without this.  For this concept please note you should only need to 
create an implementation of the DistributedMapCache which is backed by HBase 
rather than a new processor.  DetectDuplicate can use any implementation of 
that interface by design.

Thanks
Joe

> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-3644) Add DetectDuplicateUsingHBase processor

2017-03-23 Thread Bjorn Olsen (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939859#comment-15939859
 ] 

Bjorn Olsen commented on NIFI-3644:
---

Please assign to me

> Add DetectDuplicateUsingHBase processor
> ---
>
> Key: NIFI-3644
> URL: https://issues.apache.org/jira/browse/NIFI-3644
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Bjorn Olsen
>Priority: Minor
>
> The DetectDuplicate processor makes use of a distributed map cache for 
> maintaining a list of unique file identifiers (such as hashes).
> The distributed map cache functionality could be provided by an HBase table, 
> which then allows for reliably storing a huge volume of file identifiers and 
> auditing information. The downside of this approach is of course that HBase 
> is required.
> Storing the unique file identifiers in a reliable, query-able manner along 
> with some audit information is of benefit to several use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)