[ https://issues.apache.org/jira/browse/STORM-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942372#comment-14942372 ]
ASF GitHub Bot commented on STORM-828: -------------------------------------- Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/668#discussion_r41088474 --- Diff: external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/HdfsBolt.java --- @@ -44,6 +49,25 @@ private transient FSDataOutputStream out; private RecordFormat format; private long offset = 0; + private static String defaultSourceDir = "/tmp/source"; + private static String defaultDestDir = "/tmp/dest"; + private static String defaultFileExtension = ".txt"; + + + public HdfsBolt() { --- End diff -- I'm not sure a no argument constructor makes a lot of since. I think you need to specify at a minimum the destination directory. > HdfsBolt takes a lot of configuration, need good defaults > --------------------------------------------------------- > > Key: STORM-828 > URL: https://issues.apache.org/jira/browse/STORM-828 > Project: Apache Storm > Issue Type: Improvement > Components: storm-hdfs > Reporter: Robert Joseph Evans > Assignee: Sanket Reddy > > The following is code from > https://github.com/apache/storm/blob/master/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/HdfsFileTopology.java > representing the amount of configuration required to use the HdfsBolt. > {code} > // sync the filesystem after every 1k tuples > SyncPolicy syncPolicy = new CountSyncPolicy(1000); > // rotate files every 1 min > FileRotationPolicy rotationPolicy = new TimedRotationPolicy(1.0f, > TimedRotationPolicy.TimeUnit.MINUTES); > FileNameFormat fileNameFormat = new DefaultFileNameFormat() > .withPath("/tmp/foo/") > .withExtension(".txt"); > RecordFormat format = new DelimitedRecordFormat() > .withFieldDelimiter("|"); > Yaml yaml = new Yaml(); > InputStream in = new FileInputStream(args[1]); > Map<String, Object> yamlConf = (Map<String, Object>) yaml.load(in); > in.close(); > config.put("hdfs.config", yamlConf); > HdfsBolt bolt = new HdfsBolt() > .withConfigKey("hdfs.config") > .withFsUrl(args[0]) > .withFileNameFormat(fileNameFormat) > .withRecordFormat(format) > .withRotationPolicy(rotationPolicy) > .withSyncPolicy(syncPolicy) > .addRotationAction(new > MoveFileAction().toDestination("/tmp/dest2/")); > {code} > This is way too much. If it were just an example showing all of the > possibilities that would be OK but of the 8 lines used in the construction of > the bolt, 5 of them are required or the bolt will blow up at run time. We > should provide reasonable defaults for everything that can have a reasonable > default. And required parameters should be passed in through the > constructor, not as builder arguments. I realize we need to maintain > backwards compatibility so we may need some new Bolt definitions. > {code} > HdfsTSVBolt bolt = new HdfsTSVBolt(outputDir); > {code} > If someone wanted to sync every 100 records instead of every 1000 we could do > {code} > TSVFileBolt bolt = new TSVFileBolt(outputDir).withSyncPolicy(new > CountSyncPolicy(100)) > {code} > I would like to see a base HdfsFileBolt that requires a record format, and an > output directory. It would have defaults for everything else. Then we could > have a TSVFileBolt and CSVFileBolt subclass it and ideally SequenceFileBolt > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)