Hi Jonathan,
Thank you for the response. This is very useful.
Using your configuration I am able to execute the Tez examples no problem. The
issue is when i attempt to run Nutch. No matter what I've tried, the
dependencies for Nutch are never found.
I've tried building a binary .tar.gz distribution of Nutch and referencing it's
URI on HDFS... this does not work and I get ClassNotFound exceptions. I've
tried referencing the Nutch .job artifact which contains all dependencies...
this does not work.
Just to confirm, I can successfully execute all Nutch jobs when
'mapreduce.framework.name' value is set to 'yarn'. We execute the jobs as
follows
hadoop jar ${NUTCH.job} $CLASS $arguments
I feel like I am very close to getting this running. I wonder if someone on
this list could make an attempt at running a job and seeing if they can
reproduce? I've uploaded the compiled .job and the nutch bash script at
https://drive.google.com/drive/folders/1yjGi8UWVZithcYWLgUINm9v6IU2Scmy5?usp=sharing
You can execute the Injector tool by running
./nutch inject crawldb urls //assuming that urls is a directory on HDFS
containing a simple text file with one URL entry i.e. http://tez.apache.org
Again, thank you to you all for any further direction. I am really keen to get
Nutch running on Tez.
lewismc
On 2020/12/17 18:09:02, Jonathan Eagles <[email protected]> wrote:
> This is what I use in production that has many benefits. In this case
> mapreduce.application.framework.path is the runtime classpath tar.gz file
> that is custom built mapreduce runtime environment, perhaps similar to nutch
> 1) localizing one tar.gz file instead of many individual jars
> 2) minimal jar has fewer class conflicts and a smaller footprint
> 3) localizing tez to tez folder (#tez) allows better control of the
> classpath to avoid java inconsistent classpath resolution of jars in same
> directory
> 4) use cluster hadooplibs false avoids using the jars from the individuals
> nodemanagers and only relies on jars listed in tez.lib.uris
>
> <property>
> <name>mapreduce.application.framework.path</name>
>
> <value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
> </property>
>
> <property>
> <name>tez.lib.uris</name>
>
> <value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
> </property>
> <property>
> <name>tez.lib.uris.classpath</name>
> <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
> </property>
> <property>
> <name>tez.use.cluster.hadoop-libs</name>
> <value>false</value>
> </property>
>
> On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <[email protected]>
> wrote:
>
> > I tried the following configuration in tez-site.xml with no luck
> >
> > <configuration>
> > <property>
> > <name>tez.lib.uris</name>
> >
> > <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> >
> > <property>
> > <name>tez.lib.uris.classpath</name>
> > <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> > </configuration>
> >
> > On 2020/12/17 17:35:28, Lewis John McGibbney <[email protected]> wrote:
> > > Hi Zhiyuan,
> > > Thanks for the guidance. I'm making progress but I am still battling
> > initial configuration management issues.
> > > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > > My tez-site.xml contains the following content
> > >
> > > <configuration>
> > > <property>
> > > <name>tez.lib.uris</name>
> > >
> >
> > <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > > </property>
> > > </configuration>
> > >
> > > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> > into classpath issues which are largely documented in the installation
> > documentation you pointed me to. I overcame these issues by simply
> > uploading the minimal directory. All seems fine at this stage as I can run
> > all of the Tez examples.
> > >
> > > I run into trouble when I try to run any job from the Nutch application.
> > For example when I run the Injector one of the Nutch plugin extension
> > points (x point org.apache.nutch.net.URLNormalizer) cannot be not found.
> > The relevant log can be seen at https://paste.apache.org/4whoe.
> > > I should note that the entire Nutch .job is available on HDFS at the URI
> > defined in the tez-site.xml above.
> > >
> > > The output of jar -tf on the nutch.job artifact can be seen at
> > https://paste.apache.org/hl8tk.
> > > Am I required to somehow describe the structural heirarchy of this
> > artifact in the tez.lib.uris.classpath configuration property?
> > >
> > > Thank you again for any guidance.
> > >
> > > lewismc
> > >
> > > On 2020/12/14 03:23:48, Zhiyuan Yang <[email protected]> wrote:
> > > > Hi Lewis,
> > > >
> > > > If there is no incompatibility, your existing job will run well on Tez
> > > > without code change. You can just follow this guide
> > > > <https://tez.apache.org/install.html> (especially step 4) to try it
> > out.
> > > >
> > > > Thanks,
> > > > Zhiyuan
> > > >
> > > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> > [email protected]>
> > > > wrote:
> > > >
> > >
> > >
> >
>