The input is a hdfs file. I am trying to create a scenario where the mappers (root tasks) are necessarily not executed at the data location. So for now, I chose the Location Hint for the tasks in a random fashion. I figured by populating VertexLocation hint, with address of random nodes, I could achieve it.
This requires setting parallelism to be the number of elements in VertexLocation hint; which led to the errors. Summarizing, for the work count example, 1. Can the number of tasks for the tokenizer be a value *NOT* equal to the number of HDFS blocks of the file ? 2. Can a mapper be scheduled at a location different than the location of its input block ? If yes, how ? Raajay On Thu, Sep 10, 2015 at 12:30 AM, Jianfeng (Jeff) Zhang < [email protected]> wrote: > >>> In the WordCount example, while creating the Tokenizer Vertex, neither > the parallelism or VertexLocation hints is specified. My guess is that at > runtime, based on InputInitializer, these values are populated. > Correct, the parallelism and VertexLocation is specified at runtime by > InputInitializer > > >>> What should I do such that location of the tasks for the Tokenizer > vertex are not based on HDFS splits but can be arbitrarily configured while > creation ? > Do you mean your input is not hdfs file ? In that case I think you need > to create your own DataSourceDescriptor. You can refer the > DataSourceDescriptor that is used by WordCount example as following. If > possible, let us know more about your context. What kind of data is your > input ? And how would you specify the VertexLocation for your input ? > > DataSourceDescriptor dataSource = MRInput.createConfigBuilder(new > Configuration(tezConf), > > TextInputFormat.class, > inputPath).groupSplits(!isDisableSplitGrouping()).build(); > > > > Best Regard, > Jeff Zhang > > > From: Raajay <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, September 10, 2015 at 1:10 PM > To: "[email protected]" <[email protected]> > Subject: Re: Error of setting vertex location hints > > I am just getting started with understanding tez code, so bear with me; I > might be wrong here. > > In the WordCount example, while creating the Tokenizer Vertex, neither the > parallelism or VertexLocation hints is specified. My guess is that at > runtime, based on InputInitializer, these values are populated. > > However, I do not want them to be populated at runtime, but rather want > them specified while creating the DAG itself. When I do that, I get the > exception mentioned in the previous mail. > > What should I do such that location of the tasks for the Tokenizer vertex > are not based on HDFS splits but can be arbitrarily configured while > creation ? > > Raajay > > > > On Thu, Sep 10, 2015 at 12:01 AM, Jianfeng (Jeff) Zhang < > [email protected]> wrote: > >> >> Actually Tokenizer vertex should already have the VertexLocationHints >> from the hdfs file split info at runtime. Did you see any unexpected >> behavior ? >> >> >> >> Best Regard, >> Jeff Zhang >> >> >> From: Raajay <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Thursday, September 10, 2015 at 12:35 PM >> To: "[email protected]" <[email protected]> >> Subject: Error of setting vertex location hints >> >> In the WordCount example, I am trying to fix the location of map tasks by >> providing "VertexLocationHints" to the "tokenizer" vertex. >> >> However, the application fails with an exception (stacktrace below). I >> guess it is because, the vertex manager expects the parallelism to be -1, >> so that it can compute it. >> >> >> What minimal modification to the example would avoid invoking the >> VertexManager and allow me use my own customized VertexLocationHint ? >> >> >> Thanks >> Raajay >> >> >> >> DAG diagnostics: [Vertex failed, vertexName=Tokenizer, >> vertexId=vertex_1441839249749_0017_1_00, diagnostics=[Vertex >> vertex_1441839249749_0017_1_00 [Tokenizer] killed/failed due >> to:AM_USERCODE_FAILURE, Exception in VertexManager, >> vertex:vertex_1441839249749_0017_1_00 [Tokenizer], >> java.lang.IllegalStateException: Parallelism for the vertex should be set >> to -1 if the InputInitializer is setting parallelism, VertexName: Tokenizer >> at >> com.google.common.base.Preconditions.checkState(Preconditions.java:145) >> at >> org.apache.tez.dag.app.dag.impl.RootInputVertexManager.onRootVertexInitialized(RootInputVertexManager.java:60) >> at >> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventRootInputInitialized.invoke(VertexManager.java:610) >> at >> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:631) >> at >> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:626) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >> at >> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:626) >> at >> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:615) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> ], Vertex killed, vertexName=Summation, >> vertexId=vertex_1441839249749_0017_1_01, diagnostics=[Vertex received Kill >> in INITED state., Vertex vertex_1441839249749_0017_1_01 [Summation] >> killed/failed due to:null], DAG did not succeed due to VERTEX_FAILURE. >> failedVertices:1 killedVertices:1] >> DAG did not succeed >> >> >
