[GitHub] storm pull request: STORM-1219: Fix HDFS and Hive bolt flush/ackin...

2015-12-16 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/893#issuecomment-165168520
  
@harshach I don't believe flushing would force early file rotations.  On 
second thought, I do agree that 15 seconds would be better than 1 second.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1219: Fix HDFS and Hive bolt flush/ackin...

2015-12-16 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/893#issuecomment-165169501
  
@harshach Ah, thanks, I was only thinking of the hdfs bolt side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1408] clean up the build directory crea...

2015-12-22 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/964#issuecomment-166681368
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1419] Solr bolt should handle tick tupl...

2016-01-05 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/977#discussion_r48859713
  
--- Diff: 
external/storm-solr/src/main/java/org/apache/storm/solr/bolt/SolrUpdateBolt.java
 ---
@@ -92,11 +94,19 @@ private void ack(Tuple tuple) throws 
SolrServerException, IOException {
 if (commitStgy == null) {
 collector.ack(tuple);
 } else {
-toCommitTuples.add(tuple);
-commitStgy.update();
-if (commitStgy.commit()) {
+if (TupleUtils.isTick(tuple)) {
+LOG.debug("TICK! forcing solr client commit");
+collector.ack(tuple);
--- End diff --

Should tick tuples be ack'd?  I believe that is an open question in another 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1419] Solr bolt should handle tick tupl...

2016-01-05 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/977#discussion_r48859729
  
--- Diff: 
external/storm-solr/src/main/java/org/apache/storm/solr/bolt/SolrUpdateBolt.java
 ---
@@ -92,11 +94,19 @@ private void ack(Tuple tuple) throws 
SolrServerException, IOException {
 if (commitStgy == null) {
 collector.ack(tuple);
 } else {
-toCommitTuples.add(tuple);
-commitStgy.update();
-if (commitStgy.commit()) {
+if (TupleUtils.isTick(tuple)) {
+LOG.debug("TICK! forcing solr client commit");
+collector.ack(tuple);
+commitStgy.commit();
 solrClient.commit(solrMapper.getCollection());
 ackCommittedTuples();
+} else {
+toCommitTuples.add(tuple);
+commitStgy.update();
+if (commitStgy.commit()) {
--- End diff --

The strategy in AbstractHdfsBolt is to set a boolean in the case of a tick 
tuple and then sync if that value is true or if other conditions dictate a 
sync.  
(https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java#L154)
 

The benefit of that approach is to eliminate duplicate code (i.e. that 
calls to ackCommittedTuples() and solrClient.commit()), which I think is a 
substantial benefit.

Here that would looks something like:

```code
if (forceCommit || commitStgy.commit()) {
  solrClient.commit(solrMapper.getCollection());
  ackCommittedTuples();
}
```
With duplicate code removed I would be +1

A unit test would also be helpful. HdfsBolt example is here: 
https://github.com/apache/storm/blob/master/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/TestHdfsBolt.java#L175


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-468] java.io.NotSerializableException s...

2016-01-05 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/477#issuecomment-169046382
  
Would it be cleaner to just catch NotSerializableException directly and 
rethrow an IllegalStateException?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: AvroGenericRecordBolt instead of SequenceFileB...

2016-01-05 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/989#issuecomment-169050930
  
I would be +1 on this change if the PR were reopened.  Thank you for 
catching my sloppy error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1419] Solr bolt should handle tick tupl...

2016-01-05 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/977#issuecomment-169147171
  
I am not familiar with all the ways that bolts can receive tick tuples, but 
does SolrBolt also need a way to specify that it should receive tick tuples at 
all?  The hdfs-bolt way of doing that is here:


https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java#L221


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1419] Solr bolt should handle tick tupl...

2016-01-06 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/977#issuecomment-169330085
  
@vesense Thanks for the info about the global configuration. Aside from the 
issues @HeartSaVioR brings up, I think adding bolt-level option for tick tuples 
to the solr bolt is necessary. Without that, the only way for a user to enable 
ticks for the solr bolt is to enable them for the entire topology.  That's very 
different than the other output bolts (hdfs, hive) that support ticks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1494: Support multiple file outputs

2016-01-25 Thread dossett
GitHub user dossett opened a pull request:

https://github.com/apache/storm/pull/1044

STORM-1494: Support multiple file outputs

This PR will enable storm-hdfs to write to multiple files in a couple of 
different circumstances:

- If a bolt requires multiple output files.  For example, the avro bolt 
needs to write different Avro records with different schemas to different 
files.  Schema evolution is a common use of Avro so I expect this to be common 
use.

- Partitioning output of the HDFS bolt.  Based on mailing discussion, there 
is demand for this feature.

It does introduce a couple of incompatible changes.  The most obvious is 
adding a method to the FileRotationPolicy interface.  I tried to minimize other 
API changes at the expense of some less than elegant code in a couple of places.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dossett/storm STORM-1494

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1044.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1044


commit 4e16d7a60771155a9c294e43c7c0db7bcebed646
Author: Aaron Dossett 
Date:   2016-01-25T23:12:57Z

STORM-1494: Support multiple file outputs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-01-25 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-174728333
  
Title updated to reflect correct JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-27 Thread dossett
GitHub user dossett opened a pull request:

https://github.com/apache/storm/pull/1052

STORM-1504: Add Serializer and instruction for AvroGenericRecordBolt

This was new for me, so some questions I have:

- Would be better to automatically register this serializer instead of 
providing developer instructions?
- What's the best practice for exception handling in a serializer? Throwing 
a RunTimeException seemed like the best option.

Provided this PR is accepted, I would also vote for backporting this to 
1.0.0 since the AvroGenericRecordBolt is unusable without it in a multi-worker 
topology.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dossett/storm STORM-1504

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1052


commit eda0bfa07abe07d2b77a42e3f9496ac55959c909
Author: Aaron Dossett 
Date:   2016-01-27T16:49:19Z

STORM-1504: Add Serializer and instruction for AvroGenericRecordBolt




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-27 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51038995
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/AvroGenericSerializer.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.common;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.Serializer;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericContainer;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericDatumWriter;
+import org.apache.avro.io.BinaryEncoder;
+import org.apache.avro.io.Decoder;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.io.EncoderFactory;
+
+import java.io.IOException;
+
+//Generously adapted from:

+//https://github.com/kijiproject/kiji-express/blob/master/kiji-express/src/main/scala/org/kiji/express/flow/framework/serialization/AvroSerializer.scala
+//Which has as an ASL2.0 license
+public class AvroGenericSerializer extends Serializer {
+@Override
+public void write(Kryo kryo, Output output, GenericContainer record) {
+output.writeString(record.getSchema().toString());
+GenericDatumWriter writer = new 
GenericDatumWriter<>(record.getSchema());
+
+BinaryEncoder encoder = EncoderFactory
+.get()
+.directBinaryEncoder(output, null);
+try {
+writer.write(record, encoder);
+} catch (IOException e) {
+throw new RuntimeException(e);
+}
+}
+
+@Override
+public GenericContainer read(Kryo kryo, Input input, 
Class aClass) {
+Schema theSchema = new Schema.Parser().parse(input.readString());
--- End diff --

Thanks, do you have suggestions for a different approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-27 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51063276
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/AvroGenericSerializer.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.common;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.Serializer;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericContainer;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericDatumWriter;
+import org.apache.avro.io.BinaryEncoder;
+import org.apache.avro.io.Decoder;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.io.EncoderFactory;
+
+import java.io.IOException;
+
+//Generously adapted from:

+//https://github.com/kijiproject/kiji-express/blob/master/kiji-express/src/main/scala/org/kiji/express/flow/framework/serialization/AvroSerializer.scala
+//Which has as an ASL2.0 license
+public class AvroGenericSerializer extends Serializer {
+@Override
+public void write(Kryo kryo, Output output, GenericContainer record) {
+output.writeString(record.getSchema().toString());
+GenericDatumWriter writer = new 
GenericDatumWriter<>(record.getSchema());
+
+BinaryEncoder encoder = EncoderFactory
+.get()
+.directBinaryEncoder(output, null);
+try {
+writer.write(record, encoder);
+} catch (IOException e) {
+throw new RuntimeException(e);
+}
+}
+
+@Override
+public GenericContainer read(Kryo kryo, Input input, 
Class aClass) {
+Schema theSchema = new Schema.Parser().parse(input.readString());
--- End diff --

Thanks for the thoughtful comments @revans2. I think I understand what 
you're describing but want to make sure.

- A generic registry would just treat the schema as the key and vice versa, 
so it's always passed around.

```
class GenericAvroSchemaRegistry implements AvroSchemaRegistry {
   public String getKey(Schema schema) {
  return schema.toString();
   }

public Schema getSchema(String key) {
   return new Schema.Parser().parse(key);
   }
```

- A jar-based registry approach could be used for schemas that are known in 
advance and worth  persisting across the entire topology, but fall back on the 
generic approach above if an unknown schema is used.
- You might choose to implement your own own which relies on your 
proprietary registry, which is truly external to Storm
- At Target we could implement our own since we also use an external schema 
registry, albeit one that is already open sourced.

Do I have the gist of it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-28 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51127135
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/AvroGenericSerializer.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.common;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.Serializer;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericContainer;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericDatumWriter;
+import org.apache.avro.io.BinaryEncoder;
+import org.apache.avro.io.Decoder;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.io.EncoderFactory;
+
+import java.io.IOException;
+
+//Generously adapted from:

+//https://github.com/kijiproject/kiji-express/blob/master/kiji-express/src/main/scala/org/kiji/express/flow/framework/serialization/AvroSerializer.scala
+//Which has as an ASL2.0 license
+public class AvroGenericSerializer extends Serializer {
+@Override
+public void write(Kryo kryo, Output output, GenericContainer record) {
+output.writeString(record.getSchema().toString());
+GenericDatumWriter writer = new 
GenericDatumWriter<>(record.getSchema());
+
+BinaryEncoder encoder = EncoderFactory
+.get()
+.directBinaryEncoder(output, null);
+try {
+writer.write(record, encoder);
+} catch (IOException e) {
+throw new RuntimeException(e);
+}
+}
+
+@Override
+public GenericContainer read(Kryo kryo, Input input, 
Class aClass) {
+Schema theSchema = new Schema.Parser().parse(input.readString());
--- End diff --

Thanks @abhishekagarwal87. I like that implementation when all the schemas 
are known in advance.  The use case I've been writing this for needs to support 
arbitrary schema evolution over the life of the topology, so I have not thought 
about it from that perspective.

As to the default, I think the safest option is the one that doesn't 
require the user to do anything, but I would be interested in hearing other 
opinions on that as well.

Thanks for the feedback everyone, I hope to have some code added to this PR 
soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-28 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51171090
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/AvroGenericSerializer.java
 ---
@@ -0,0 +1,70 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.common;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.Serializer;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericContainer;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericDatumWriter;
+import org.apache.avro.io.BinaryEncoder;
+import org.apache.avro.io.Decoder;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.io.EncoderFactory;
+
+import java.io.IOException;
+
+//Generously adapted from:

+//https://github.com/kijiproject/kiji-express/blob/master/kiji-express/src/main/scala/org/kiji/express/flow/framework/serialization/AvroSerializer.scala
+//Which has as an ASL2.0 license
+public class AvroGenericSerializer extends Serializer {
+@Override
+public void write(Kryo kryo, Output output, GenericContainer record) {
+output.writeString(record.getSchema().toString());
+GenericDatumWriter writer = new 
GenericDatumWriter<>(record.getSchema());
+
+BinaryEncoder encoder = EncoderFactory
+.get()
+.directBinaryEncoder(output, null);
+try {
+writer.write(record, encoder);
+} catch (IOException e) {
+throw new RuntimeException(e);
+}
+}
+
+@Override
+public GenericContainer read(Kryo kryo, Input input, 
Class aClass) {
+Schema theSchema = new Schema.Parser().parse(input.readString());
--- End diff --

I pushed a commit that includes a lot of the above, but is not complete.

@revans2 I included support for Confluent's registry.  Some basic testing 
in our (Target's) dev environments seemed good.
@abhishekagarwal87 I haven't used BiMaps before, so I'm sure that part 
could be greatly improved.

Still to be done:
- make the the unit tests more robust and parameterized.
- add capability to inject different AvroSchemaRegistry implementations 
into the generic serializer, which might be challenging


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-29 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-176801519
  
Thanks everyone for the feedback on the code.  I am beginning to appreciate 
some of the challenges of serialization, in particular whatever serializer is 
registered with kryo can only be invoked with a default constructor, so my 
naive implementation, which relies on initializations via constructors in some 
spots will not work. The meaning of @revans2's comments about reading data from 
a special file in the jar are also becoming clearer to me -- that's a way to 
get information needed for initialization without relying on constructor 
parameters.  I'll take another pass at cleaning up at least some of these 
issues and fixing/addressing comments.  There's some excitement here about 
using our Confluent registry to speed up serialization, so I really want to 
make this work.

These are some new corners of java for me, so in the worst case scenario 
I'll simply have learned a lot.  Thanks everyone!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-30 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-177263999
  
@revans2 @abhishekagarwal87 Addressed your comments and made the serializer 
scheme usable by kryo.  Removed the ConfluentAvroSerializer unit test because I 
can not yet get a usable, test-local schema registry up and running, but it 
does work fine in my dev environment.

@abhishekagarwal87 I could not compile with that shaded package name, but I 
am also not an advanced user of shade, so I could be missing something.  Also, 
I left the GenericSerializer as the default for now.  I would rather have a 
slow default that works and then guide them to better options in documentation, 
but I'm open to counter-arguments.

Still to do, based on additional feedback:
- refine unit tests one more time
- add documentation

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-30 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51349470
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/avro/ConfluentAvroSerializer.java
 ---
@@ -0,0 +1,81 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.avro;
+
+import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
+import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
+import 
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException;
+import org.apache.avro.Schema;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+
+/**
+ *
+ */
+public class ConfluentAvroSerializer extends AbstractAvroSerializer {
+
+private SchemaRegistryClient theClient;
+final private String url;
+
+public ConfluentAvroSerializer() throws IOException {
+//Empty
+InputStream in = 
this.getClass().getClassLoader().getResourceAsStream("ConfluentAvroSerializer.config");
+BufferedReader reader = new BufferedReader(new 
InputStreamReader(in));
+url = reader.readLine();
+}
+
+@Override
+public String getFingerprint(Schema schema) {
+if (theClient == null)
+{
+theClient = initializeClient();
+}
+String subject = schema.getName();
+final int guid;
+try {
+guid = theClient.register(subject, schema);
+System.out.println("GUID: [" + guid + "]");
--- End diff --

Oops, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-30 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51349531
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/avro/ConfluentAvroSerializer.java
 ---
@@ -0,0 +1,81 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.avro;
+
+import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
+import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
+import 
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException;
+import org.apache.avro.Schema;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+
+/**
+ *
+ */
+public class ConfluentAvroSerializer extends AbstractAvroSerializer {
+
+private SchemaRegistryClient theClient;
+final private String url;
+
+public ConfluentAvroSerializer() throws IOException {
+//Empty
--- End diff --

I'd like to, but it wasn't obvious how I could expose the global config to 
that class since kryo will only use the default constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-01-30 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-177363506
  
Code upmerged and commits squashed.  @abhishekagarwal87 Your pointer to the 
serialization examples was terrific, thank you!  It worked perfectly and really 
cleaned up that class.

Documentation still to come, assuming the interface has stabilized.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-01 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51421641
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/avro/DefinedAvroSerializer.java
 ---
@@ -0,0 +1,62 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaNormalization;
+import org.apache.commons.codec.binary.Base64;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.security.NoSuchAlgorithmException;
+import java.util.HashMap;
+import java.util.Map;
+
+public class DefinedAvroSerializer extends AbstractAvroSerializer {
+
+private final static String FP_ALGO = "CRC-64-AVRO";
+final Map fingerprint2schemaMap = new HashMap<>();
+final Map schema2fingerprintMap = new HashMap<>();
+
+public DefinedAvroSerializer() throws IOException, 
NoSuchAlgorithmException {
--- End diff --

@abhishekagarwal87 In this case, getting the pre-defined schemas from a 
file felt like the right thing to do.  A user could specify an arbitrary number 
of schemas, each of arbitrary size.  That didn't feel right to put in the 
global storm config.  Happy to hear other opinions on that though.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-01 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-178093239
  
Documentation added and DefinedAvroSerializer renamed to 
FixedAvroSerializer.

The travis failures seem unrelated to this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-01 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-178200940
  
rat exclusions updated, imports fixed, commits squashed.

Thank you again @revans2 and @abhishekagarwal87 for the feedback. I ended 
up learning way more from this exercise than I expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-02 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-178648710
  
Thanks @revans2, I will check it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-02 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1052#issuecomment-178672701
  
@revans2 Spot on, I just pushed that change. Thanks for catching that. If 
the travis builds pass, I will commit today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1504: Add Serializer and instruction for...

2016-02-02 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1052#discussion_r51609059
  
--- Diff: external/storm-hdfs/README.md ---
@@ -315,6 +314,18 @@ An `org.apache.avro.Schema` object cannot be directly 
provided since it does not
 The AvroGenericRecordBolt expects to receive tuples containing an Avro 
GenericRecord that conforms to the provided
 schema.
 
+To use this bolt you **must** register the appropriate Kryo serializers 
with your topology configuration.  A convenience
+method is provided for this:
+
+```AvroGenericRecordBolt.addAvroKryoSerializations(conf);```
+
+By default Storm will use the ```GenericAvroSerializer``` to handle 
serialization.  This will work, but there are much 
+faster options available if you can pre-define the schemas you will be 
using or utilize an external schema registry. An
+implementation using the Confluent Schema Registry is provided, but others 
can be implemented and provided to Storm.
+Please see the javadoc for classes in org.apache.storm.hdfs.avro for 
information about using the built-in options or
+creating your own.
+
--- End diff --

It looked good in IntelliJ, thanks for checking it here. Should be fixed 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1518: Backport of STORM-1504

2016-02-02 Thread dossett
GitHub user dossett opened a pull request:

https://github.com/apache/storm/pull/1066

STORM-1518: Backport of STORM-1504



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dossett/storm STORM-1518

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1066


commit ea9abeb2d67402cf82e80adf6498a30a38b3a0a4
Author: Aaron Dossett 
Date:   2016-02-02T21:35:39Z

STORM-1518: Backport of STORM-1504




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Storm 1526 - Improve Storm core performance

2016-02-04 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1080#issuecomment-180173001
  
@roshannaik Can you update the title of this PR to begin with "STORM-1526" 
so that it gets associated with the JIRA?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-02-05 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-180420947
  
I up-merged this PR.  Does anyone have feedback on this change?  Are API 
breaking changes off the table for 2.0?  If so, i can close this PR and come 
back to it at a later time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-828 HdfsBolt takes a lot of configuratio...

2016-02-05 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/668#issuecomment-180552919
  
@redsanket Are you still working on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Added scope for junit and mockito dependencies

2016-02-08 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1088#issuecomment-181587349
  
+1

@revans2 The travis failures seem unrelated to this change?  I pulled 
@shoebamer's PR and could not reproduce the test failures locally (with `mvn 
clean integration-test -Pall-tests -Prat`)  If you are still +1 I'll merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1536: Remove Java use of TimeCacheMap

2016-02-09 Thread dossett
GitHub user dossett opened a pull request:

https://github.com/apache/storm/pull/1092

STORM-1536: Remove Java use of TimeCacheMap



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dossett/storm TimeCache

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1092.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1092


commit 6caab2e83846bac63e01239e968c1595201c4aa9
Author: Aaron Dossett 
Date:   2016-02-09T17:34:44Z

STORM-1536: Remove Java use of TimeCacheMap




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1536: Remove Java use of TimeCacheMap

2016-02-10 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1092#issuecomment-182624697
  
Ah, thanks. TimeCacheMap was marked as deprecated, so this was a learning 
exercise as much as anything, that would hopefully have some code hygiene 
benefit.  Much more to be done before it's that apparently, so I'll close the 
PR.  Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1536: Remove Java use of TimeCacheMap

2016-02-10 Thread dossett
Github user dossett closed the pull request at:

https://github.com/apache/storm/pull/1092


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1536: Remove Java use of TimeCacheMap

2016-02-11 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1092#issuecomment-182878387
  
@knusbaum Just to be clear, you're suggesting that the `@deprecated` 
annotation and explanation be removed from TimeCacheMap?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1541 Change scope of 'hadoop-minicluster...

2016-02-11 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1102#issuecomment-183147381
  
+1 for both branches


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: storm-hdfs : change visibility of create and c...

2016-02-18 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1122#issuecomment-185828905
  
Would there be a drawback to making all of the abstract methods protected 
if we want to support end user extensions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Add a version to snakeyaml dependency

2016-02-18 Thread dossett
GitHub user dossett opened a pull request:

https://github.com/apache/storm/pull/1126

Add a version to snakeyaml dependency



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dossett/storm snake

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/1126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1126






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Force to be compatible with heron and add test...

2016-02-19 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1056#issuecomment-186397905
  
This appears to have been merged to master, was that intentional?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Force to be compatible with heron and add test...

2016-02-19 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1056#issuecomment-186407865
  
Thanks! Good learning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1490: STORM-1902 (1.x): add a simple & flexible FileName...

2016-06-17 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1490#discussion_r67519979
  
--- Diff: 
external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/format/TestSimpleFileNameFormat.java
 ---
@@ -0,0 +1,77 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.bolt.format;
+
+import java.net.UnknownHostException;
+import java.text.SimpleDateFormat;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.storm.task.TopologyContext;
+import org.apache.storm.utils.Utils;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestSimpleFileNameFormat {
+
+@Test
+public void testDefaults() {
+SimpleFileNameFormat format = new SimpleFileNameFormat();
+   format.prepare(null, createTopologyContext());
+String path = format.getPath();
+String name = format.getName(1, System.currentTimeMillis());
+
+Assert.assertEquals("/storm", path);
+String now = new 
SimpleDateFormat("MMddHHmmss").format(System.currentTimeMillis());
+Assert.assertEquals(now+".1.txt", name);
--- End diff --

This is a nit, but this test might occasionally fail since the time is 
taken at two different points which might resolve to different seconds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1490: STORM-1902 (1.x): add a simple & flexible FileName...

2016-06-17 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1490#discussion_r67520068
  
--- Diff: 
external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/format/TestSimpleFileNameFormat.java
 ---
@@ -0,0 +1,77 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.bolt.format;
+
+import java.net.UnknownHostException;
+import java.text.SimpleDateFormat;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.storm.task.TopologyContext;
+import org.apache.storm.utils.Utils;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestSimpleFileNameFormat {
+
+@Test
+public void testDefaults() {
+SimpleFileNameFormat format = new SimpleFileNameFormat();
+   format.prepare(null, createTopologyContext());
+String path = format.getPath();
+String name = format.getName(1, System.currentTimeMillis());
+
+Assert.assertEquals("/storm", path);
+String now = new 
SimpleDateFormat("MMddHHmmss").format(System.currentTimeMillis());
+Assert.assertEquals(now+".1.txt", name);
+}
+
+@Test
+public void testParameters() {
+SimpleFileNameFormat format = new SimpleFileNameFormat()
+.withName("$TIME.$HOST.$COMPONENT.$TASK.$NUM.txt")
+.withPath("/mypath")
+.withTimeFormat("-MM-dd HH:mm:ss");
+format.prepare(null, createTopologyContext());
+String path = format.getPath();
+String name = format.getName(1, System.currentTimeMillis());
+
+Assert.assertEquals("/mypath", path);
+String now = new SimpleDateFormat("-MM-dd 
HH:mm:ss").format(System.currentTimeMillis());
--- End diff --

Same comment as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1490: STORM-1902 (1.x): add a simple & flexible FileNameFormat ...

2016-06-17 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1490
  
Two minor nits noted inline, but I am otherwise +1.  Test failures look 
unrelated.  

This is a nice addition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1494: Update README.md

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1494
  
Applied to master, 1.x-branch, and 1.0.x-branch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1490: STORM-1902 (1.x): add a simple & flexible FileNameFormat ...

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1490
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1489: STORM-1902: add a simple & flexible FileNameFormat for st...

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1489
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1478: [STORM-1896] HdfsSpout remove duplicated code

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1478
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1439: Add mvn versions note to DEVELOPER

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1439
  
Applied this separately from the PR since my fork was deleted since I 
opened this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1439: Add mvn versions note to DEVELOPER

2016-06-21 Thread dossett
Github user dossett closed the pull request at:

https://github.com/apache/storm/pull/1439


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1497: STORM-1909 Updating HDFS spout documentation

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1497
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1496: Storm 1909 - Updating HDFS spout documentation

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1496
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1504: 1.0.x branch

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1504
  
@qubibo Could you close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1462: [STORM-1883] FileReader extends Closeable Interface

2016-06-21 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1462
  
Are there any functional changes here?  I'm all in favor of using standard 
java interfaces, but I don't see any benefits of `closeable` being used here.  
Possible changes could include:

- Using the try-with-resources idiom now that FileReader is `closeable` [1]
- Concrete classes no longer have to catch their own `IOException`s since 
close is declared to throw that [2]

[1] - 
https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html
[2] - 
https://github.com/darionyaphet/storm/blob/ad74ad10012df6999c94652993bcd66e90db53ff/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/spout/TextFileReader.java#L104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1521: STORM-1901: Avro Integration for Storm-Kafka

2016-06-26 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1521
  
Thanks @vesense I will review this week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1521: STORM-1901: Avro Integration for Storm-Kafka

2016-06-27 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1521#discussion_r68657788
  
--- Diff: external/storm-kafka/README.md ---
@@ -360,6 +361,44 @@ For Trident:
 StormSubmitter.submitTopology("kafkaTridentTest", conf, 
topology.build());
 ```
 
+## Avro Integration for Storm-Kafka
+
+To integrate with Avro you **must** register the appropriate Kryo 
serializers with your topology configuration.  A convenience method is provided 
for this:
+
+`AvroUtils.addAvroKryoSerializations(conf);`
+
+###AvroSchema
+
+`AvroSchema` is an implementation of `Schema`. You can use AvroSchema to 
read Avro GenericRecord from Kafka:
--- End diff --

Should be `AvroScheme` and `Scheme`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1521: STORM-1901: Avro Integration for Storm-Kafka

2016-06-27 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1521
  
@harshach My interpretation of the PR is that it moves some Avro speficif 
code I added to storm-hdfs into a more sharable module.

@vesense I like the idea of making it easier to read/write avro data in a 
Storm topology, thanks for looking at this.  I don't know that it needs to be 
its own module though, maybe it could be somewhere in `org.apache.storm.utils` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1521: STORM-1901: Avro Integration for Storm-Kafka

2016-06-27 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1521#discussion_r68658648
  
--- Diff: 
external/storm-kafka/src/jvm/org/apache/storm/kafka/AvroScheme.java ---
@@ -0,0 +1,59 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.kafka;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericContainer;
+import org.apache.storm.avro.DefaultDirectAvroSerializer;
+import org.apache.storm.avro.DirectAvroSerializer;
+import org.apache.storm.spout.Scheme;
+import org.apache.storm.tuple.Fields;
+import org.apache.storm.tuple.Values;
+import org.apache.storm.utils.Utils;
+
+public class AvroScheme implements Scheme {
+public static final String AVRO_SCHEME_KEY = "avro";
+DirectAvroSerializer serializer = new DefaultDirectAvroSerializer();
--- End diff --

Could this be more flexible, the way AbstractAvroSerializer is above?  I 
would see creating an AvroSchema and supplying a serializer for it to use based 
on how the avro data is being put onto the Kafka topic.

My original use case was to read avro data that had been serialized onto a 
kafka topic with a Confluent serializer.  My solution was to create a bolt that 
that read the raw data with rawscheme, deserialized the raw bytes into a 
GenericRecord, and then emitted the generic record.

Something like this would have been very nice if I could give it a 
ConfluentDeserializer via dependency injection or something similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1549: Fix Comment Error

2016-07-18 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1549
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1566: [STORM-1971] HDFS Timed Synchronous Policy

2016-07-18 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1566#discussion_r71151878
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/sync/CountSyncPolicy.java
 ---
@@ -26,7 +26,7 @@
  * have been processed.
  */
 public class CountSyncPolicy implements SyncPolicy {
-private int count;
+private final int count;
--- End diff --

Great catch to make that final. I would be +1 on that change in a separate 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1639: Appying for Elasticsearch 2.3.5

2016-09-06 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1639
  
@kojiisd Thanks for this contribution. We have a few different pull 
requests related to support for ElasticSearch 2.X.  I'll start a discussion on 
the DEV list for the best way to handle them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1744: STORM-1276: line for line translation of nimbus to java

2016-10-26 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1744
  
I don't know enough about nimbus or clojure to vote on this PR, but this is 
an exciting change!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1751: [STORM-2172][SQL] Support Avro as input / output format

2016-11-08 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1751
  
storm-hdfs also has Avro (de)serializer functionality, including the 
ability to plug into different Avro schema registries 
(https://github.com/apache/storm/tree/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/avro).
  Can that functionality be refactored and shared in `storm-sql`?  Otherwise 
there will be two independent avro serialization schemes in Storm.

Apologies for the late comments on this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1773: STORM-2198: perform RotationAction when stopping HdfsBolt

2016-11-15 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1773
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1791: STORM-2212: Remove Redundant Declarations in Maven POM Fi...

2016-11-22 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1791
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1859: STORM-2276 Remove twitter4j usages due to license issue (...

2017-01-05 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1859
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1858: STORM-2276 Remove twitter4j usages due to license issue (...

2017-01-05 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1858
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1860: STORM-2276 Remove twitter4j usages due to license issue (...

2017-01-05 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1860
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1044: STORM-1464: Support multiple file outputs

2017-02-09 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1044
  
@ptgoetz This PR was never, to my knowledge, applied to 1.x.  It contained 
a very small incompatible API change to FileRotation, so users with their own 
FileRotation implementation would get broken.  I'd certainly be open to 
applying it to 1.x, but that wouldn't be consistent with semantic versioning, 
FWIW.  (cc @matthewdfleming)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request #1978: STORM-2387 Handle tick tuples properly for Bolts i...

2017-02-28 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1978#discussion_r103453575
  
--- Diff: 
external/storm-druid/src/main/java/org/apache/storm/druid/bolt/DruidBeamBolt.java
 ---
@@ -76,30 +77,33 @@ public void prepare(Map stormConf, TopologyContext 
context, OutputCollector coll
 
 @Override
 public void execute(final Tuple tuple) {
-  Future future = 
tranquilizer.send((druidEventMapper.getEvent(tuple)));
-LOG.debug("Sent tuple : [{}]" , tuple);
+if (TupleUtils.isTick(tuple)) {
--- End diff --

@satishd I agree in principle.  I once experimented with that approach and 
what I found difficult was declaring the `throws` for each of those methods 
since bolts can generate all sorts of exceptions.

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1044: STORM-1464: Support multiple file outputs

2017-05-03 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1044
  
Hi @ryanpersaud -- that is a very good observation, it would seem that 
doesn't happen.  I admit to being biased against rotation actions and would 
never use them in production, which is probably why I never noticed it.  But 
that's not a good enough reason for a bug like that!

Since changing jobs last year I don't have the opportunity to actively work 
on Storm anymore, but if you have thoughts on a fix I would certainly give a 
code review my full attention. A failing unit test to confirm the behavior 
(using a Mockito spy to confirm that close isn't called) would be a very 
helpful start.

Thank you for pointing this out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1044: STORM-1464: Support multiple file outputs

2017-05-10 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1044
  
@ryanpersaud Your basic approach looks right to me, thanks for taking the 
lead on this.

Regarding rotation actions, my concern is that there's no absolute 
guarantee they will get executed. If a worker crashes and restarts then the 
files it had open before will never have rotation actions performed against 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2157: STORM-2517 storm-hdfs writers can't be subclassed

2017-06-27 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/2157
  
I hadn't thought about subclassing the writers, instead I thought a new 
bolt would implements its own writer.  @Angus-Slalom what use case do you have 
in mind?  I'm not opposed to the change, just curious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2157: STORM-2517 storm-hdfs writers can't be subclassed

2017-06-27 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/2157
  
@Angus-Slalom Thanks, that makes sense! Feel free to contribute logging 
changes as well if you think they are generally applicable :-)

I'm at +1 pending the change Harsha suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #1044: STORM-1464: Support multiple file outputs

2017-11-07 Thread dossett
Github user dossett commented on the issue:

https://github.com/apache/storm/pull/1044
  
@fescandell I don't believe there is. Because I was unfamiliar with the 
Trident API, I didn't make any similar changes to it.


---


[GitHub] storm pull request #1337: STORM-1475: Add storm-elasticsearch2 module

2018-09-18 Thread dossett
Github user dossett closed the pull request at:

https://github.com/apache/storm/pull/1337


---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-14 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-196316194
  
Thanks @arunmahadevan and @harshach.  I'll rebase and review feedback this 
week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56240076
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java
 ---
@@ -52,9 +54,10 @@
  * Half of the default Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
  */
 private static final int DEFAULT_TICK_TUPLE_INTERVAL_SECS = 15;
+private static final Integer DEFAULT_MAX_OPEN_FILES = 5;
--- End diff --

Fair point.  What seems reasonable.  20? 50?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56240831
  
--- Diff: external/storm-hdfs/README.md ---
@@ -240,6 +240,23 @@ If you are using Trident and sequence files you can do 
something like this:
 .addRotationAction(new 
MoveFileAction().withDestination("/dest2/"));
 ```
 
+### Data Partitioning
+Data can be partitioned to different HDFS directories based on 
characteristics of the tuple being processed or purely
+external factors, such as system time.  To partition your your data, write 
a class that implements the ```Partitioner```
+interface and pass it to the withPartitioner() method of your bolt. The 
getPartitionPath() method returns a partition 
+path for a given tuple.
+
+Here's an example of a Partitioner that operates on a specific field of 
data:
+
+```java
+
+Partitioner partitoner = new Partitioner() {
+@Override
+public String getPartitionPath(Tuple tuple) {
+return Path.SEPARATOR + "city=" + 
tuple.getStringByField("city");
--- End diff --

The "city=" was from a Hive-specific use case I was working on.  The 
partitioner would return "city=" to create a partitioning scheme that 
would be usable for a Hive external table.  I will remove that from the 
standard documentation and maybe add a hive-specific note.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56241176
  
--- Diff: external/storm-hdfs/README.md ---
@@ -240,6 +240,23 @@ If you are using Trident and sequence files you can do 
something like this:
 .addRotationAction(new 
MoveFileAction().withDestination("/dest2/"));
 ```
 
+### Data Partitioning
+Data can be partitioned to different HDFS directories based on 
characteristics of the tuple being processed or purely
+external factors, such as system time.  To partition your your data, write 
a class that implements the ```Partitioner```
+interface and pass it to the withPartitioner() method of your bolt. The 
getPartitionPath() method returns a partition 
+path for a given tuple.
+
+Here's an example of a Partitioner that operates on a specific field of 
data:
+
+```java
+
+Partitioner partitoner = new Partitioner() {
+@Override
+public String getPartitionPath(Tuple tuple) {
+return Path.SEPARATOR + "city=" + 
tuple.getStringByField("city");
--- End diff --

I thought about having Partitioner returning an actual path but decided 
against it for two reasons:
- I liked the idea of the "partition" being solely a function of the tuple 
without reference to anything else
- Since end users implement a Partitioner having it return a complete path 
would give the user access to details otherwise hidden from their code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56241745
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java
 ---
@@ -198,22 +194,62 @@ public final void execute(Tuple tuple) {
 }
 }
 
-if(this.rotationPolicy.mark(tuple, this.offset)) {
-try {
-rotateOutputFile();
-this.rotationPolicy.reset();
-this.offset = 0;
-} catch (IOException e) {
-this.collector.reportError(e);
-LOG.warn("File could not be rotated");
-//At this point there is nothing to do.  In all 
likelihood any filesystem operations will fail.
-//The next tuple will almost certainly fail to write 
and/or sync, which force a rotation.  That
-//will give rotateAndReset() a chance to work which 
includes creating a fresh file handle.
-}
+if (writer != null && writer.needsRotation()) {
+doRotationAndRemoveWriter(writerKey, writer);
 }
 }
 }
 
+private AbstractHDFSWriter getOrCreateWriter(String writerKey, Tuple 
tuple) throws IOException {
+AbstractHDFSWriter writer;
+
+writer = writers.get(writerKey);
+if (writer == null) {
+if (writers.size() >= maxOpenFiles)
+{
+String keyToOldest = getKeyToOldestWriter();
+AbstractHDFSWriter oldest = writers.get(keyToOldest);
+rotateOutputFile(oldest);
+writers.remove(keyToOldest);
--- End diff --

Great suggestion, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56247649
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java
 ---
@@ -223,29 +259,57 @@ public final void execute(Tuple tuple) {
 public void declareOutputFields(OutputFieldsDeclarer 
outputFieldsDeclarer) {
 }
 
-/**
- * writes a tuple to the underlying filesystem but makes no guarantees 
about syncing data.
- *
- * this.offset is also updated to reflect additional data written
- *
- * @param tuple
- * @throws IOException
- */
-abstract void writeTuple(Tuple tuple) throws IOException;
+private void syncAllWriters() throws IOException {
+for (AbstractHDFSWriter writer : writers.values()) {
+writer.sync();
+}
+}
 
-/**
- * Make the best effort to sync written data to the underlying file 
system.  Concrete classes should very clearly
- * state the file state that sync guarantees.  For example, HdfsBolt 
can make a much stronger guarantee than
- * SequenceFileBolt.
- *
- * @throws IOException
- */
-abstract void syncTuples() throws IOException;
+private String getKeyToOldestWriter()
+{
+String oldestKey = null;
+long oldestTime = Long.MAX_VALUE;
+for (final Map.Entry entry : 
writers.entrySet()) {
+if (entry.getValue().getLastUsedTime() < oldestTime) {
+oldestKey = entry.getKey();
+oldestTime = entry.getValue().getLastUsedTime();
+}
+}
 
-abstract void closeOutputFile() throws IOException;
+return oldestKey;
+}
 
-abstract Path createOutputFile() throws IOException;
+private void startTimedRotationPolicy() {
+long interval = 
((TimedRotationPolicy)this.rotationPolicy).getInterval();
+this.rotationTimer = new Timer(true);
+TimerTask task = new TimerTask() {
+@Override
+public void run() {
+for (final AbstractHDFSWriter writer : writers.values()) {
+try {
+rotateOutputFile(writer);
+} catch (IOException e) {
+LOG.warn("IOException during scheduled file 
rotation.", e);
+}
+}
+writers.clear();
+}
+};
+this.rotationTimer.scheduleAtFixedRate(task, interval, interval);
+}
+
+protected Path getBasePathForNextFile(Tuple tuple) {
+
+Path fullPathToFile = new Path(this.fsUrl + 
this.fileNameFormat.getPath() + this.partitioner.getPartitionPath(tuple),
+this.fileNameFormat.getName(this.rotation, 
System.currentTimeMillis()));
--- End diff --

Very good observation, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56249577
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java
 ---
@@ -223,29 +259,57 @@ public final void execute(Tuple tuple) {
 public void declareOutputFields(OutputFieldsDeclarer 
outputFieldsDeclarer) {
 }
 
-/**
- * writes a tuple to the underlying filesystem but makes no guarantees 
about syncing data.
- *
- * this.offset is also updated to reflect additional data written
- *
- * @param tuple
- * @throws IOException
- */
-abstract void writeTuple(Tuple tuple) throws IOException;
+private void syncAllWriters() throws IOException {
+for (AbstractHDFSWriter writer : writers.values()) {
+writer.sync();
+}
+}
 
-/**
- * Make the best effort to sync written data to the underlying file 
system.  Concrete classes should very clearly
- * state the file state that sync guarantees.  For example, HdfsBolt 
can make a much stronger guarantee than
- * SequenceFileBolt.
- *
- * @throws IOException
- */
-abstract void syncTuples() throws IOException;
+private String getKeyToOldestWriter()
+{
+String oldestKey = null;
+long oldestTime = Long.MAX_VALUE;
+for (final Map.Entry entry : 
writers.entrySet()) {
+if (entry.getValue().getLastUsedTime() < oldestTime) {
+oldestKey = entry.getKey();
+oldestTime = entry.getValue().getLastUsedTime();
+}
+}
 
-abstract void closeOutputFile() throws IOException;
+return oldestKey;
+}
 
-abstract Path createOutputFile() throws IOException;
+private void startTimedRotationPolicy() {
+long interval = 
((TimedRotationPolicy)this.rotationPolicy).getInterval();
+this.rotationTimer = new Timer(true);
+TimerTask task = new TimerTask() {
+@Override
+public void run() {
+for (final AbstractHDFSWriter writer : writers.values()) {
+try {
+rotateOutputFile(writer);
+} catch (IOException e) {
+LOG.warn("IOException during scheduled file 
rotation.", e);
+}
+}
+writers.clear();
+}
+};
+this.rotationTimer.scheduleAtFixedRate(task, interval, interval);
+}
+
+protected Path getBasePathForNextFile(Tuple tuple) {
+
+Path fullPathToFile = new Path(this.fsUrl + 
this.fileNameFormat.getPath() + this.partitioner.getPartitionPath(tuple),
+this.fileNameFormat.getName(this.rotation, 
System.currentTimeMillis()));
+
+return fullPathToFile;
+}
 
 abstract void doPrepare(Map conf, TopologyContext topologyContext, 
OutputCollector collector) throws IOException;
 
+abstract String getWriterKey(Tuple tuple);
--- End diff --

I don't believe it can go away entirely.  One of the things it does is 
track the need for simultaneous multiple open files within the same partition 
directory.  For example, the Avro bolt must write objects with different 
schemas to different files regardless of partitioning.  

I think it's best to keep that separate notion and have a key definition 
that isn't just the partition path.  I will think about that some more though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on a diff in the pull request:

https://github.com/apache/storm/pull/1044#discussion_r56264969
  
--- Diff: 
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java
 ---
@@ -52,9 +54,10 @@
  * Half of the default Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
  */
 private static final int DEFAULT_TICK_TUPLE_INTERVAL_SECS = 15;
+private static final Integer DEFAULT_MAX_OPEN_FILES = 5;
--- End diff --

@rajkonda I like that idea, but there's any sort of fan out could happen 
between a spout and the HDFS Bolt.  I'll just start with a much larger number 
like 20.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-197087032
  
@arunmahadevan I have addressed several of your comments -- thanks for the 
review, it was helpful. Per my other comments, the key can't be based only on 
the partition information.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Fixed incorrect storm-kafka documentation.

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1168#issuecomment-197092133
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: fix storm-starter README.markdown links

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1154#issuecomment-197095427
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Add a version to snakeyaml dependency

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1126#issuecomment-197095654
  
You are correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Add a version to snakeyaml dependency

2016-03-15 Thread dossett
Github user dossett closed the pull request at:

https://github.com/apache/storm/pull/1126


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Rename README.markdown to README.md.

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1213#issuecomment-197100612
  
Seems like a reasonable change.  Perhaps leave a README.markdown with a 
pointer to README.md though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-1624] add maven central status in READM...

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1212#issuecomment-197102565
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1581: Repair github links in storm docum...

2016-03-15 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1157#issuecomment-197103542
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Rename README.markdown to README.md.

2016-03-18 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1213#issuecomment-197563985
  
My thought was something like a "The README is now found " since 
there are surely some links to README.markdown out there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-19 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-198393099
  
@arunmahadevan Writers should never be writing to the same file.  
`getBasePathForNextFile` will increment the rotation value before opening the 
file.  File name also includes `System.currentTimeMillis()` which should 
further prevent conflict.

There's also a unit test that attempts to validate this by writing two 
different schemas and confirming that two files are created: 
https://github.com/dossett/storm/blob/STORM-1494/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/AvroGenericRecordBoltTest.java#L156-L168



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: hotfix: parent version for pom.xml in storm-mo...

2016-03-19 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1228#issuecomment-197833762
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-21 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-199406859
  
Thanks @arunmahadevan and @harshach. I made one final documentation tweak 
and squashed the commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-25 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-201490715
  
@arunmahadevan @harshach Are you both still +1.  I can merge, but want to 
double check first.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-30 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-203615216
  
Just to reiterate, I tried to minimize public API changes, but a couple 
were unavoidable.  e.g., FileRotation interface has a new method that all any 
implementation would have to support.  Just want to make sure those are ok for 
1.x (and 2.0 for that matter)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-03-30 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-203623408
  
@ptgoetz Thanks for confirming... even though I would have loved to get 
this is 1.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-1464: Support multiple file outputs

2016-04-07 Thread dossett
Github user dossett commented on the pull request:

https://github.com/apache/storm/pull/1044#issuecomment-206992265
  
Merged.  Thank you to everyone who provided feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   >