[ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954737#comment-13954737
 ] 

Sergey commented on MAHOUT-1498:
--------------------------------

So I've replaced all 
{code}
DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
{code}
with
{code}
DistributedCache.addCacheFile(dictionaryFilePath.toUri(), conf);
{code}

Now my jars are not thrown away from distirubted cache. These jars are used in 
subsequent MR job submission.
Also I've modified several reducers. Reducers did expect to get single file in 
distCache. Here is an example:
{code}
//TFPartialVectorReducer
@Override
    protected void setup(Context context) throws IOException, 
InterruptedException {
        super.setup(context);
        Configuration conf = context.getConfiguration();
        URI[] localFiles = DistributedCache.getCacheFiles(conf);
        Preconditions.checkArgument(localFiles != null && localFiles.length >= 
1,
                "missing paths from the DistributedCache");

        dimension = conf.getInt(PartialVectorMerger.DIMENSION, 
Integer.MAX_VALUE);
        sequentialAccess = 
conf.getBoolean(PartialVectorMerger.SEQUENTIAL_ACCESS, false);
        namedVector = conf.getBoolean(PartialVectorMerger.NAMED_VECTOR, false);
        maxNGramSize = conf.getInt(DictionaryVectorizer.MAX_NGRAMS, 
maxNGramSize);

        //Path dictionaryFile = new Path(localFiles[0].getPath());
        Path dictionaryFile = getPathToDictionaryFile(localFiles);
        // key is word value is id
        for (Pair<Writable, IntWritable> record
                : new SequenceFileIterable<Writable, 
IntWritable>(dictionaryFile, true, conf)) {
            dictionary.put(record.getFirst().toString(), 
record.getSecond().get());
        }
    }

    private Path getPathToDictionaryFile(URI[] localFiles){
        for(URI distCacheFile : localFiles){
            System.out.println("getPathToDictionaryFile ::: " + (distCacheFile 
== null ? null : distCacheFile.toString()));
            if(distCacheFile!=null && 
distCacheFile.toString().contains("dictionary.file")){
                System.out.println("getPathToDictionaryFile ::: looks like 
["+distCacheFile+"] is a dictionary we need");
                return new Path(distCacheFile.getPath());
            }
        }
        URI lastUri = localFiles[localFiles.length-1];
        System.out.println("getPathToDictionaryFile ::: didn't find dict file. 
Trying to return the last one ["+lastUri.toString()+"]");
        return new Path(lastUri.getPath());
    }
{code}

I'm not sure is it good or bad, and now my oozie action runs without any 
problems. Here is a workflow action:
{code}
<action name="run-mahout-item_info_catalog_category_id">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete 
path="${nameNode}/staging/working/mahout/run-mahout-item_info_catalog_category_id/out"
 />
            </prepare>
            <configuration>
                <property>
                    <name>mapred.queue.name</name>
                    <value>default</value>
                </property>
            </configuration>
            
<main-class>org.apache.mahout.vectorizer.SparseVectorsFromSequenceFilesDirtyHack</main-class>

            <arg>-Ddfs.blocksize=1m</arg>

            <arg>--input</arg>
            
<arg>${nameNode}/staging/working/mahout/prepare-item_info_catalog_category_id/out</arg>

            <arg>--output</arg>
            
<arg>${nameNode}/staging/working/mahout/run-mahout-item_info_catalog_category_id/out</arg>

            <arg>-ow</arg>

            <arg>-x</arg>
            <arg>70</arg>

            <arg>-ng</arg>
            <arg>4</arg>

            <arg>-n</arg>
            <arg>2</arg>

            <arg>-seq</arg>

            <arg>-wt</arg>
            <arg>TFIDF</arg>
        </java>
        <ok to="mahout-join-node"/>
        <error to="kill"/>
    </action>
{code}



> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -----------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1498
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1498
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: mahout-core-0.7-cdh4.4.0.jar
>            Reporter: Sergey
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>          String sfiles = StringUtils.uriToString(files);
>          conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to