Robert Joseph Evans created STORM-828:
-----------------------------------------

             Summary: HdfsBolt takes a lot of configuration, need good defaults
                 Key: STORM-828
                 URL: https://issues.apache.org/jira/browse/STORM-828
             Project: Apache Storm
          Issue Type: Improvement
            Reporter: Robert Joseph Evans


The following is code from 
https://github.com/apache/storm/blob/master/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/HdfsFileTopology.java
 representing the amount of configuration required to use the HdfsBolt. 
{code}
        // sync the filesystem after every 1k tuples
        SyncPolicy syncPolicy = new CountSyncPolicy(1000);

        // rotate files every 1 min
        FileRotationPolicy rotationPolicy = new TimedRotationPolicy(1.0f, 
TimedRotationPolicy.TimeUnit.MINUTES);

        FileNameFormat fileNameFormat = new DefaultFileNameFormat()
                .withPath("/tmp/foo/")
                .withExtension(".txt");

        RecordFormat format = new DelimitedRecordFormat()
                .withFieldDelimiter("|");

        Yaml yaml = new Yaml();
        InputStream in = new FileInputStream(args[1]);
        Map<String, Object> yamlConf = (Map<String, Object>) yaml.load(in);
        in.close();
        config.put("hdfs.config", yamlConf);

        HdfsBolt bolt = new HdfsBolt()
                .withConfigKey("hdfs.config")
                .withFsUrl(args[0])
                .withFileNameFormat(fileNameFormat)
                .withRecordFormat(format)
                .withRotationPolicy(rotationPolicy)
                .withSyncPolicy(syncPolicy)
                .addRotationAction(new 
MoveFileAction().toDestination("/tmp/dest2/"));
{code}

This is way too much.  If it were just an example showing all of the 
possibilities that would be OK but of the 8 lines used in the construction of 
the bolt, 5 of them are required or the bolt will blow up at run time.  We 
should provide reasonable defaults for everything that can have a reasonable 
default.  And required parameters should be passed in through the constructor, 
not as builder arguments.  I realize we need to maintain backwards 
compatibility so we may need some new Bolt definitions.

{code}
HdfsTSVBolt bolt = new HdfsTSVBolt(outputDir);
{code}

If someone wanted to sync every 100 records instead of every 1000 we could do

{code}
TSVFileBolt bolt = new TSVFileBolt(outputDir).withSyncPolicy(new 
CountSyncPolicy(100))
{code}

I would like to see a base HdfsFileBolt that requires a record format, and an 
output directory.  It would have defaults for everything else.  Then we could 
have a TSVFileBolt and CSVFileBolt subclass it and ideally SequenceFileBolt as 
well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to