[jira] [Updated] (BEAM-1491) HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environment variable
[ https://issues.apache.org/jira/browse/BEAM-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated BEAM-1491: -- Summary: HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environment variable (was: HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environmen variable) > HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) > environment variable > - > > Key: BEAM-1491 > URL: https://issues.apache.org/jira/browse/BEAM-1491 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 0.6.0 >Reporter: yangping wu >Assignee: Jean-Baptiste Onofré > > Currently, if we want to read file store on HDFS, we will do it as follow: > {code} > PRead.Bounded> from = > Read.from(HDFSFileSource.from("hdfs://hadoopserver:8020/tmp/data.txt", > TextInputFormat.class, LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > or > {code} > Configuration conf = new Configuration(); > conf.set("fs.default.name", "hdfs://hadoopserver:8020"); > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class).withConfiguration(conf)); > PCollection > data = p.apply(from); > {code} > As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the > file path > if we can initialize {{conf}} by reading > {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable, then we can read > HDFS file like this: > {code} > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the > program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen, and > the program will read file from HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1856) HDFSFileSink class do not use the same configuration in master and slave
yangping wu created BEAM-1856: - Summary: HDFSFileSink class do not use the same configuration in master and slave Key: BEAM-1856 URL: https://issues.apache.org/jira/browse/BEAM-1856 Project: Beam Issue Type: Bug Components: sdk-java-core Affects Versions: 0.6.0 Reporter: yangping wu Assignee: Davor Bonaci I have a code snippet as follow: {code} Read.Bounded> from = Read.from(HDFSFileSource.from(options.getInputFile(), TextInputFormat.class, LongWritable.class, Text.class)); PCollection > data = p.apply(from); data.apply(MapElements.via(new SimpleFunction , String>() { @Override public String apply(KV input) { return input.getValue() + "\t" + input.getValue(); } })).apply(Write.to(HDFSFileSink.toText(options.getOutputFile(; {code} and submit job like this: {code} spark-submit --class org.apache.beam.examples.WordCountHDFS --master yarn-client \ ./target/word-count-beam-bundled-0.1.jar \ --runner=SparkRunner \ --inputFile=hdfs://master/tmp/input/ \ --outputFile=/tmp/output/ {code} Then {{HDFSFileSink.validate}} function will check whether the local filesystem (not HDFS) exists {{/tmp/output/}} directory. But the final result will store in {{hdfs://master/tmp/output/}} directory in HDFS filesystem. The reason is {{HDFSFileSink}} class do not use the same configuration in master thread and slave thread. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1491) HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environmen variable
[ https://issues.apache.org/jira/browse/BEAM-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated BEAM-1491: -- Affects Version/s: (was: 0.5.0) 0.6.0 > HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) > environmen variable > > > Key: BEAM-1491 > URL: https://issues.apache.org/jira/browse/BEAM-1491 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 0.6.0 >Reporter: yangping wu >Assignee: Jean-Baptiste Onofré > > Currently, if we want to read file store on HDFS, we will do it as follow: > {code} > PRead.Bounded> from = > Read.from(HDFSFileSource.from("hdfs://hadoopserver:8020/tmp/data.txt", > TextInputFormat.class, LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > or > {code} > Configuration conf = new Configuration(); > conf.set("fs.default.name", "hdfs://hadoopserver:8020"); > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class).withConfiguration(conf)); > PCollection > data = p.apply(from); > {code} > As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the > file path > if we can initialize {{conf}} by reading > {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable, then we can read > HDFS file like this: > {code} > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the > program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen, and > the program will read file from HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1491) HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environmen variable
[ https://issues.apache.org/jira/browse/BEAM-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated BEAM-1491: -- Description: Currently, if we want to read file store on HDFS, we will do it as follow: {code} PRead.Bounded> from = Read.from(HDFSFileSource.from("hdfs://hadoopserver:8020/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); PCollection > data = p.apply(from); {code} or {code} Configuration conf = new Configuration(); conf.set("fs.default.name", "hdfs://hadoopserver:8020"); PRead.Bounded > from = Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class).withConfiguration(conf)); PCollection > data = p.apply(from); {code} As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the file path if we can initialize {{conf}} by reading {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable, then we can read HDFS file like this: {code} PRead.Bounded > from = Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); PCollection > data = p.apply(from); {code} note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen, and the program will read file from HDFS. was: Currently, if we want to read file store on HDFS, we will do it as follow: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "hdfs://hadoopserver:8020/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the file path, and we cann't set any variables when read file, because in [HDFSFileSource.java|https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L310] we initialize {{job}} instance as follow: {code} this.job = Job.getInstance(); {code} we should initialize {{job}} instance by configure: {code} this.job = Job.getInstance(conf); {code} where {{conf}} is instance of {{Configuration}}, and we initialize {{conf}} by reading {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable,then we can read HDFS file like this: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen. > HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) > environmen variable > > > Key: BEAM-1491 > URL: https://issues.apache.org/jira/browse/BEAM-1491 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 0.5.0 >Reporter: yangping wu >Assignee: Jean-Baptiste Onofré > > Currently, if we want to read file store on HDFS, we will do it as follow: > {code} > PRead.Bounded > from = > Read.from(HDFSFileSource.from("hdfs://hadoopserver:8020/tmp/data.txt", > TextInputFormat.class, LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > or > {code} > Configuration conf = new Configuration(); > conf.set("fs.default.name", "hdfs://hadoopserver:8020"); > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class).withConfiguration(conf)); > PCollection > data = p.apply(from); > {code} > As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the > file path > if we can initialize {{conf}} by reading > {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable, then we can read > HDFS file like this: > {code} > PRead.Bounded > from = > Read.from(HDFSFileSource.from("/tmp/data.txt", TextInputFormat.class, > LongWritable.class, Text.class)); > PCollection > data = p.apply(from); > {code} > note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the > program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen, and > the program will read file from HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1491) HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) environmen variable
[ https://issues.apache.org/jira/browse/BEAM-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated BEAM-1491: -- Description: Currently, if we want to read file store on HDFS, we will do it as follow: {code} PCollection> resultCollection = p.apply(HDFSFileSource.readFrom( "hdfs://hadoopserver:8020/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the file path, and we cann't set any variables when read file, because in [HDFSFileSource.java|https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L310] we initialize {{job}} instance as follow: {code} this.job = Job.getInstance(); {code} we should initialize {{job}} instance by configure: {code} this.job = Job.getInstance(conf); {code} where {{conf}} is instance of {{Configuration}}, and we initialize {{conf}} by reading {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable,then we can read HDFS file like this: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen. was: Currently, if we want to read file store on HDFS, we will do it as follow: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "hdfs://hadoopserver:8020/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the file path, and we cann't set any variables when read file, because in [HDFSFileSource.java|https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L310] we initialize {{job}} instance as follow: {code} this.job = Job.getInstance(); {code} we should initialize {{job}} instance by configure: {code} this.job = Job.getInstance(conf); {code} where {{conf}} is instance of {{Configuration}}, and we initialize {{conf}} by reading {{HADOOP_CONF}}({{YARN_CONF}}) environmen variable,then we can read HDFS file like this: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the program read it from {{HADOOP_CONF}}({{YARN_CONF}}) environmen. > HDFSFileSource should be able to read the HADOOP_CONF_DIR(YARN_CONF_DIR) > environmen variable > > > Key: BEAM-1491 > URL: https://issues.apache.org/jira/browse/BEAM-1491 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: 0.5.0 >Reporter: yangping wu >Assignee: Jean-Baptiste Onofré > > Currently, if we want to read file store on HDFS, we will do it as follow: > {code} PCollection > resultCollection = > p.apply(HDFSFileSource.readFrom( > "hdfs://hadoopserver:8020/tmp/data.txt", > TextInputFormat.class, LongWritable.class, Text.class)); > {code} > As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the > file path, and we cann't set any variables when read file, because in > [HDFSFileSource.java|https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L310] > we initialize {{job}} instance as follow: > {code} > this.job = Job.getInstance(); > {code} > we should initialize {{job}} instance by configure: > {code} > this.job = Job.getInstance(conf); > {code} > where {{conf}} is instance of {{Configuration}}, and we initialize {{conf}} > by reading {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen variable,then we > can read HDFS file like this: > {code} PCollection > resultCollection = > p.apply(HDFSFileSource.readFrom( > "/tmp/data.txt", > TextInputFormat.class, LongWritable.class, Text.class)); > {code} > note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the > program read it from {{HADOOP_CONF_DIR}}({{YARN_CONF_DIR}}) environmen. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1491) HDFSFileSource should be able to read the HADOOP_CONF(YARN_CONF) environmen variable
yangping wu created BEAM-1491: - Summary: HDFSFileSource should be able to read the HADOOP_CONF(YARN_CONF) environmen variable Key: BEAM-1491 URL: https://issues.apache.org/jira/browse/BEAM-1491 Project: Beam Issue Type: Improvement Components: sdk-java-core Affects Versions: 0.5.0 Reporter: yangping wu Assignee: Davor Bonaci Currently, if we want to read file store on HDFS, we will do it as follow: {code} PCollection> resultCollection = p.apply(HDFSFileSource.readFrom( "hdfs://hadoopserver:8020/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} As we have seen above, we must be set {{hdfs://hadoopserver:8020}} in the file path, and we cann't set any variables when read file, because in [HDFSFileSource.java|https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java#L310] we initialize {{job}} instance as follow: {code} this.job = Job.getInstance(); {code} we should initialize {{job}} instance by configure: {code} this.job = Job.getInstance(conf); {code} where {{conf}} is instance of {{Configuration}}, and we initialize {{conf}} by reading {{HADOOP_CONF}}({{YARN_CONF}}) environmen variable,then we can read HDFS file like this: {code} PCollection > resultCollection = p.apply(HDFSFileSource.readFrom( "/tmp/data.txt", TextInputFormat.class, LongWritable.class, Text.class)); {code} note we don't specify {{hdfs://hadoopserver:8020}} prefix, because the program read it from {{HADOOP_CONF}}({{YARN_CONF}}) environmen. -- This message was sent by Atlassian JIRA (v6.3.15#6346)