[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1498:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed with a few cosmetic changes, thank you for the contribution!

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>Assignee: Sebastian Schelter
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1498.patch
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Status: Patch Available  (was: Open)

>From 4c95084229830b88f0bdec63b0967e5d6fceb58d Mon Sep 17 00:00:00 2001
From: seregasheypak 
Date: Sun, 18 May 2014 20:20:16 +0400
Subject: [PATCH] MAHOUT-1498 Do not reset app jars stored in
 DistributedCache. Now you can run it as oozie Java action.
 Just bundle dependent jars in workflow/lib folder.

---
 .../mahout/util/DistributedCacheFileLocator.java   |   47 
 .../mahout/vectorizer/DictionaryVectorizer.java|   21 -
 .../vectorizer/term/TFPartialVectorReducer.java|   14 +++---
 .../mahout/vectorizer/tfidf/TFIDFConverter.java|   11 +++--
 .../tfidf/TFIDFPartialVectorReducer.java   |6 ++-
 .../util/DistributedCacheFileLocatorTest.java  |   38 
 6 files changed, 109 insertions(+), 28 deletions(-)
 create mode 100644 
mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
 create mode 100644 
mrlegacy/src/test/java/org/apache/mahout/util/DistributedCacheFileLocatorTest.java

diff --git 
a/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
 
b/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
new file mode 100644
index 000..8a59908
--- /dev/null
+++ 
b/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
@@ -0,0 +1,47 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.util;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.net.URI;
+
+public class DistributedCacheFileLocator {
+
+private static final Logger log = 
LoggerFactory.getLogger(DistributedCacheFileLocator.class);
+
+/**
+ * Finds a file in Distributed cache
+ * @param aPartOfName is a substring in file name
+ * @param localFiles holds references to files stored in distributed cache
+ * @return Path instance to first matched file or null
+ * */
+public Path findByContainsInName(String aPartOfName, URI[] localFiles){
+for(URI distCacheFile : localFiles){
+log.debug("find a file in distributed cache by part of name {}", 
aPartOfName);
+if(distCacheFile!=null && 
distCacheFile.toString().contains(aPartOfName)){
+log.debug("found a file [{}] using a part of name[{}]", 
distCacheFile.toString(), aPartOfName);
+return new Path(distCacheFile.getPath());
+}
+}
+return null;
+}
+
+}
diff --git 
a/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java 
b/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
index 99ef019..64e5a67 100644
--- 
a/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
+++ 
b/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
@@ -17,11 +17,6 @@
 
 package org.apache.mahout.vectorizer;
 
-import java.io.IOException;
-import java.net.URI;
-import java.util.Collection;
-import java.util.List;
-
 import com.google.common.base.Preconditions;
 import com.google.common.collect.Lists;
 import com.google.common.io.Closeables;
@@ -29,11 +24,7 @@ import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.filecache.DistributedCache;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.io.IntWritable;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.io.SequenceFile;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
@@ -59,6 +50,10 @@ import org.apache.mahout.vectorizer.term.TermCountReducer;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.IOException;
+import java.util.Collection;
+import java.util.List;
+
 /**
  * This class converts a set of input documents in th

[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Status: Open  (was: Patch Available)

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>  Labels: patch
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Status: Open  (was: Patch Available)

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Status: Patch Available  (was: Open)

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Attachment: MAHOUT-1498.patch

PATCH

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1498.patch
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1498:
---

Labels: patch  (was: )
Status: Patch Available  (was: Open)

>From 4c95084229830b88f0bdec63b0967e5d6fceb58d Mon Sep 17 00:00:00 2001
From: seregasheypak 
Date: Sun, 18 May 2014 20:20:16 +0400
Subject: [PATCH] MAHOUT-1498 Do not reset app jars stored in
 DistributedCache. Now you can run it as oozie Java action.
 Just bundle dependent jars in workflow/lib folder.

---
 .../mahout/util/DistributedCacheFileLocator.java   |   47 
 .../mahout/vectorizer/DictionaryVectorizer.java|   21 -
 .../vectorizer/term/TFPartialVectorReducer.java|   14 +++---
 .../mahout/vectorizer/tfidf/TFIDFConverter.java|   11 +++--
 .../tfidf/TFIDFPartialVectorReducer.java   |6 ++-
 .../util/DistributedCacheFileLocatorTest.java  |   38 
 6 files changed, 109 insertions(+), 28 deletions(-)
 create mode 100644 
mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
 create mode 100644 
mrlegacy/src/test/java/org/apache/mahout/util/DistributedCacheFileLocatorTest.java

diff --git 
a/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
 
b/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
new file mode 100644
index 000..8a59908
--- /dev/null
+++ 
b/mrlegacy/src/main/java/org/apache/mahout/util/DistributedCacheFileLocator.java
@@ -0,0 +1,47 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.util;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.net.URI;
+
+public class DistributedCacheFileLocator {
+
+private static final Logger log = 
LoggerFactory.getLogger(DistributedCacheFileLocator.class);
+
+/**
+ * Finds a file in Distributed cache
+ * @param aPartOfName is a substring in file name
+ * @param localFiles holds references to files stored in distributed cache
+ * @return Path instance to first matched file or null
+ * */
+public Path findByContainsInName(String aPartOfName, URI[] localFiles){
+for(URI distCacheFile : localFiles){
+log.debug("find a file in distributed cache by part of name {}", 
aPartOfName);
+if(distCacheFile!=null && 
distCacheFile.toString().contains(aPartOfName)){
+log.debug("found a file [{}] using a part of name[{}]", 
distCacheFile.toString(), aPartOfName);
+return new Path(distCacheFile.getPath());
+}
+}
+return null;
+}
+
+}
diff --git 
a/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java 
b/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
index 99ef019..64e5a67 100644
--- 
a/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
+++ 
b/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
@@ -17,11 +17,6 @@
 
 package org.apache.mahout.vectorizer;
 
-import java.io.IOException;
-import java.net.URI;
-import java.util.Collection;
-import java.util.List;
-
 import com.google.common.base.Preconditions;
 import com.google.common.collect.Lists;
 import com.google.common.io.Closeables;
@@ -29,11 +24,7 @@ import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.filecache.DistributedCache;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.io.IntWritable;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.io.SequenceFile;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
@@ -59,6 +50,10 @@ import org.apache.mahout.vectorizer.term.TermCountReducer;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.IOException;
+import java.util.Collection;
+import java.util.List;
+
 /**
  * This class converts a s

[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-04-13 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1498:
---

Fix Version/s: 1.0

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)