[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002807#comment-14002807
 ] 

Saikat Kanjilal commented on MAHOUT-1490:
-

Dmitry,
I am not able to push my code changes to your repo, it says I dont have 
permissions, can you add me to the list of contributors, I've introduced 4 new 
classes for the 4 data types (String/Integer/Double/Long) that extend the 
DataFrameLike trait and contain and Unsafe data type as an internal variable, I 
need to push these changes so let me know when the permissions are set or what 
may be the issue.

Thanks in advance.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-19 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ] 

Anand Avati commented on MAHOUT-1529:
-

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002658#comment-14002658
 ] 

Dmitriy Lyubimov commented on MAHOUT-1542:
--

done in stage but for some reason it doesn't publish site for me. CMS infra 
problems again perhaps. Staging looks fine.

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002651#comment-14002651
 ] 

Dmitriy Lyubimov commented on MAHOUT-1529:
--

ok i started nudging this a bit forward, did a couple of fairly drastical 
refactoring, moving api parts to math-scala. math-scala should compile . 
Decompositions are moved too.

Things left include moving package-level routines requiring implicit context; 
fixing spark and spark-shell modules; moving tests where appropriate. 

With tests. a little conundrum is such that we don't have a "local" engine -- 
we would use "Spark local" for that, i.e. some concrete engine. So even though 
decomposition code now completely lives in math-scala with no spark 
dependencies, it looks like its tests will still have to live in spark module, 
where unit-testing in local spark mode is defined. It kinda makes sense, since 
we probably will want to run MathSuite separately for each engine we add; but 
is a bit weird since it keeps something like ssvd() and its engine-specific 
tests apart.

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1544) make Mahout DSL shell depend dynamically on Spark

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1544:
-

Resolution: Won't Fix
  Assignee: Dmitriy Lyubimov
Status: Resolved  (was: Patch Available)

> make Mahout DSL shell depend dynamically on Spark
> -
>
> Key: MAHOUT-1544
> URL: https://issues.apache.org/jira/browse/MAHOUT-1544
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: 0001-spark-shell-rename-to-shell.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch
>
>
> Today the Mahout's scala shell depends on spark.
> Create a cleaner separation between the shell and Spark. For e.g, the in core 
> scalabindings and operators do not need Spark. So make Spark a runtime 
> "addon" to the shell. Similarly in the future new distributed backend engines 
> can transparently (dynamically) be available through the DSL shell.
> The new shell works, looks and feels exactly like the shell before, but has a 
> cleaner modular architecture.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1544) make Mahout DSL shell depend dynamically on Spark

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002649#comment-14002649
 ] 

Dmitriy Lyubimov commented on MAHOUT-1544:
--

I would suggest to close it, given the uncertainty about the progress.

> make Mahout DSL shell depend dynamically on Spark
> -
>
> Key: MAHOUT-1544
> URL: https://issues.apache.org/jira/browse/MAHOUT-1544
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
> Attachments: 0001-spark-shell-rename-to-shell.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
> 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch
>
>
> Today the Mahout's scala shell depends on spark.
> Create a cleaner separation between the shell and Spark. For e.g, the in core 
> scalabindings and operators do not need Spark. So make Spark a runtime 
> "addon" to the shell. Similarly in the future new distributed backend engines 
> can transparently (dynamically) be available through the DSL shell.
> The new shell works, looks and feels exactly like the shell before, but has a 
> cleaner modular architecture.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-19 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002587#comment-14002587
 ] 

Anand Avati commented on MAHOUT-1490:
-

[~dlyubimov] I will try my best to explain what's happening. Disclaimer that I 
was neither the designer nor implementer, and only have an understanding of how 
it works by reading the source and asking questions.

It first helps to understand as background how data is laid out in an H2O 
cloud. A huge matrix (100Ks cols X billions of rows) of numbers (Frame) can be 
imagined to consist of columns (Vectors). A Vector (column) is sliced into 
chunks, and entire chunks are hashed/distributed across the cluster. Chunks 
which hold elements of a given row of the Frame (matrix) are guaranteed to be 
on the same server (i.e "similarly partitioned" in Spark lingo). A chunk is 
typically a few MBs in size, i.e expect to store a few 100k to a few million 
adjacent elements of a given column. The reason for such a "columnar" 
orientation is because elements in a column are expected to be "similar" 
(compared to elements in a row) and therefore better compression can be 
applied. The size of a Chunk should be large enough to make compression 
meaningful (e.g chunk of 8 or 16 elements is too small for compression) and 
small enough such that there is not "too much variance" (see next).

Given the background it now makes sense to see how chunk compression works. 
Compression of each Chunk is an independent process, and a different algorithm 
may be used to compress different Chunks, even if two Chunks belong to the same 
Vector/Column. The choice of the compression algorithm to be used is determined 
by inspecting the elements after they are all made available. For e.g, when a 
new large matrix is being read from disk, the elements are first read into a 
datatype called "NewChunk". You can only "append" elements in the NewChunk 
phase. Once a NewChunk is filled with enough elements, it is then compressed 
into a Chunk. A Chunk itself is an abstract class. Based on the compression 
algorithm, there are many concrete implementations of Chunk (there are 16 
different compression algorithms/implementations as of now), available as 
CChunk.java in 
https://github.com/0xdata/h2o/tree/master/src/main/java/water/fvec. 
NewChunk.compress() (the link shared above) is the method which converts the 
inflated NewChunk into the most appropriate compression CChunk, and the 
selection is made by inspecting all the elements.

The various strategies of compression include things like:
- Biasing: For e.g if all elements in the Chunk consist of values between 
76861433640456465 and 76861433640456480, they can actually be represented as 
char bytes (i.e data type which can hold Xmax - Xmin) along with the bias base 
recorded in the Chunk head.
- Exponent scaling: For e.g convert set of  {1.2,23,0.34} by multiplying by 100 
into {123,2300,34} which can now be represnted as 2-byte (short) int, instead 
of float.
- Counting the "types" of elements
  -- if there are just two types, no matter their values (e.g only 23 and 37), 
it is still represented as a boolean bitvector
  -- if there is just one element value throughout (e.g only 129) then no 
memory is consumed.

Combinations of techniques such as above result in the 16 Chunk implementations 
(how they are internally "viewed as"):

- C0DChunk: All NAs/Strings/etc (trivial case)
- CXDChunk - Sparse doubles (floating point)
- C0LChunk - Constant column of longs
- C0DChunk - Constant column of doubles
- CX0Chunk - Sparse Boolean Bitvector without NAs
- CXIChunk - Sparse 1-byte (with NAs)
- CBSChunk - Dense Boolean Bitvector
- C1SChunk - Scaled to 1-byte (with bias)
- C2SChunk - Scaled to 2-byte (with bias)
- C4SChunk - Scaled to 4-byte (with bias)
- C1NChunk - Scaled to 1-byte (data readily fit into unsigned bytes)
- C1Chunk - Scaled to 1-byte (with NAs)
- C8Chunk - Scaled to 1-byte (no bias - data readily fit)
- C2Chunk - Scale to 2-byte (no bias - data readily fit)
- C4Chunk - Scale to 4-byte (no bias - data readily fit)

And inflation involves at worst simple scaling, biasing and a binary lookup - 
typically a subset. These operations happen with no load/stores, completely off 
registers. Reading compressed floats and doubles is done through unsafe get()s. 
All this results in only compressed data travelling over the memory bus and 
expands in the core. For operations like adding all elements of a matrix, the 
job is typically memory bandwidth bound.

As a side effect of this type of compression, set() can sometimes be expensive. 
If the new value passed to set() is "compatible" with the existing compression 
(i.e, fits within the scale/bias, does not convert sparse location to filled 
location etc.) it is not very expensive. But if the new value does not "fit", 
then the whole chunk is inflated back into a NewChunk, and re-evaluated with 
t

Re: [jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Pat Ferrel
Very nice!

On May 19, 2014, at 1:01 PM, Sebastian Schelter (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002271#comment-14002271
 ] 

Sebastian Schelter commented on MAHOUT-1542:


No, go ahead, thats a great idea.

> Tutorial for playing with Mahout's Spark shell
> --
> 
>Key: MAHOUT-1542
>URL: https://issues.apache.org/jira/browse/MAHOUT-1542
>Project: Mahout
> Issue Type: Improvement
> Components: Documentation, Math
>   Reporter: Sebastian Schelter
>   Assignee: Sebastian Schelter
>Fix For: 1.0
> 
> 
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



[jira] [Commented] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002405#comment-14002405
 ] 

Suneel Marthi commented on MAHOUT-1553:
---

Sergey, these issues have been addressed in Mahout 0.8 (can't recall the JIRa 
top of my head that addresses this).  0.7 is not supported anymore.

> Fix for run Mahout stuff as oozie java action
> -
>
> Key: MAHOUT-1553
> URL: https://issues.apache.org/jira/browse/MAHOUT-1553
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Attachments: MAHOUT-1553.patch
>
>
> Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
> property is not correctly pushed down to Mahout MR stuff because of new 
> Configuration usage
> at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-19 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1553:
---

Status: Patch Available  (was: Open)

Hm... looks like 0.7 problems have been fixed and CanopyDriver is marked as 
deprecated. Anyway feel free to refuse to accept patch. 
My idea is that I don't see a reason to instantiate new Configuration() in 
class which extends org.apache.hadoop.conf.Configured 
Configured does provide access to Configuration object

> Fix for run Mahout stuff as oozie java action
> -
>
> Key: MAHOUT-1553
> URL: https://issues.apache.org/jira/browse/MAHOUT-1553
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Attachments: MAHOUT-1553.patch
>
>
> Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
> property is not correctly pushed down to Mahout MR stuff because of new 
> Configuration usage
> at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-19 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated MAHOUT-1553:
---

Attachment: MAHOUT-1553.patch

> Fix for run Mahout stuff as oozie java action
> -
>
> Key: MAHOUT-1553
> URL: https://issues.apache.org/jira/browse/MAHOUT-1553
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Attachments: MAHOUT-1553.patch
>
>
> Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
> property is not correctly pushed down to Mahout MR stuff because of new 
> Configuration usage
> at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-19 Thread Sergey (JIRA)
Sergey created MAHOUT-1553:
--

 Summary: Fix for run Mahout stuff as oozie java action
 Key: MAHOUT-1553
 URL: https://issues.apache.org/jira/browse/MAHOUT-1553
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
 Environment: mahout-core-0.7-cdh4.4.0.jar
Reporter: Sergey


Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
property is not correctly pushed down to Mahout MR stuff because of new 
Configuration usage

at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-19 Thread Sergey (JIRA)
Sergey created MAHOUT-1552:
--

 Summary: Avoid new Configuration() instantiation
 Key: MAHOUT-1552
 URL: https://issues.apache.org/jira/browse/MAHOUT-1552
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
 Environment: CDH 4.4, CDH 4.6
Reporter: Sergey


Hi, it's related to MAHOUT-1498
You get troubles when run mahout stuff from oozie java action.
{code}
ava.lang.InterruptedException: Cluster Classification Driver Job failed 
processing /tmp/sku/tfidf/90453
at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002271#comment-14002271
 ] 

Sebastian Schelter commented on MAHOUT-1542:


No, go ahead, thats a great idea.

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002266#comment-14002266
 ] 

Dmitriy Lyubimov commented on MAHOUT-1542:
--

[~ssc] do you mind if i rewrite the math symbols in latex/mathjax?

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-19 Thread Ted Dunning
On Mon, May 19, 2014 at 11:08 AM, Dmitriy Lyubimov (JIRA)
wrote:

> [~avati] do you think you could perhaps explain (or reference principled
> foundation publication) of the algorithm that is happening here?


One of the most commonly effective compression techniques is dictionary +
run-length.  For instance, the binary matrices that much of our software
uses would have massive compression using this.

For instance, a binary vector with 1million elements with 0.01% sparsity
would compress to about less than 200 bytes using these techniques and a
very naive implementation.  Our current sparse representation requires
about 1200 bytes.


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002082#comment-14002082
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--


bq.  Compression is implemented in 
https://github.com/0xdata/h2o/blob/master/src/main/java/water/fvec/NewChunk.java#L379.
 

[~avati] do you think you could perhaps explain (or reference principled 
foundation publication) of the algorithm that is happening here?

 thanks.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: hosted ScalaDocs?

2014-05-19 Thread Dmitriy Lyubimov
it would seem spark just publishes the release scaladocs, perhaps manually.



On Sun, May 18, 2014 at 11:01 AM, Pat Ferrel  wrote:

> Yeah, it can be done with maven to create a “javadoc” jar. I was just
> trying to get IDEA to do it and can’t make that work so was looking for
> some other examples. We may have a missing dependency. I’ll try to track
> that down first. Most of the googling explained how to do it with sbt. I’ll
> go dig into Spark a bit more.
>
> On May 17, 2014, at 8:11 PM, Dmitriy Lyubimov  wrote:
>
> Heh. Good question... scaladoc is the javadoc for scala. I dont know, so
> unless it is automagically detected by jenkins or whatever system is used
> to publish them, the answer is probably no. I am fairly sure i have set up
> maven to generate those as this is something i always do for scala code,
> but this is probably yet another area we could look at what Spark does
> about it within apache infrastructure.
> On May 17, 2014 11:10 AM, "Pat Ferrel"  wrote:
>
> > Is there a scaladoc server for the mahout Scala code in the 1.0 snapshot?
> > It looks like we are using ScalaDoc conventions in Scala code.
>
>


Re: UploadedDRM dir in mahout-spark root

2014-05-19 Thread Dmitriy Lyubimov
I think scala tests are not having common approach to temproary files,
hence they just write stuff into default directory. This has been a bit
hasty. Do you want to fix it? or suggest a fix? I know the java-side unit
test has something for it but not scala side.


On Sun, May 18, 2014 at 2:07 PM, Stevo Slavić  wrote:

> Hello team,
>
> Thought it was just me, but it's on Jenkins too (see
> https://builds.apache.org/view/All/job/mahout-nightly/ws/trunk/spark/ ) -
> running build with tests creates UploadedDRM directory in the root of
> mahout-spark module. DrmLikeSuite might be likely cause/producer.
>
> Kind regards,
> Stevo Slavic
>


Re: Build failed in Jenkins: Mahout-Quality #2608

2014-05-19 Thread Dmitriy Lyubimov
I am woefully uneducated on Jenkins stuff.


On Sun, May 18, 2014 at 12:07 AM, Sebastian Schelter  wrote:

> Does someone have to check why the build is still failing?
>
>
> On 05/13/2014 01:14 AM, Apache Jenkins Server wrote:
>
>> See 
>>
>> --
>> [...truncated 8432 lines...]
>> }
>>
>> Q=
>> {
>>0  =>{0:0.40273861426601687,1:-0.9153150324187648}
>>1  =>{0:0.9153150324227656,1:0.40273861426427493}
>> }
>>  [32m- C = A %*% B mapBlock {} [0m
>>  [32m- C = A %*% B incompatible B keys [0m
>> 36495 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtB$
>>  - A and B for A'B are not identically partitioned, performing inner join.
>>  [32m- C = At %*% B , join [0m
>> 37989 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtB$
>>  - A and B for A'B are not identically partitioned, performing inner join.
>>  [32m- C = At %*% B , join, String-keyed [0m
>> 39452 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtB$
>>  - A and B for A'B are identically distributed, performing row-wise zip.
>>  [32m- C = At %*% B , zippable, String-keyed [0m
>> {
>>2  =>{0:62.0,1:86.0,3:132.0,2:115.0}
>>1  =>{0:50.0,1:69.0,3:105.0,2:92.0}
>>3  =>{0:74.0,1:103.0,3:159.0,2:138.0}
>>0  =>{0:26.0,1:35.0,3:51.0,2:46.0}
>> }
>>  [32m- C = A %*% inCoreB [0m
>> {
>>0  =>{0:26.0,1:35.0,2:46.0,3:51.0}
>>1  =>{0:50.0,1:69.0,2:92.0,3:105.0}
>>2  =>{0:62.0,1:86.0,2:115.0,3:132.0}
>>3  =>{0:74.0,1:103.0,2:138.0,3:159.0}
>> }
>>  [32m- C = inCoreA %*%: B [0m
>> 43683 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtA$
>>  - Applying slim A'A.
>>  [32m- C = A.t %*% A [0m
>> 45370 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtA$
>>  - Applying non-slim non-graph A'A.
>> 70680 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings
>>  - test done.
>>  [32m- C = A.t %*% A fat non-graph [0m
>> 71986 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
>> org.apache.mahout.sparkbindings.blas.AtA$
>>  - Applying slim A'A.
>>  [32m- C = A.t %*% A non-int key [0m
>>  [32m- C = A + B [0m
>>  [32m- C = A + B side test 1 [0m
>>  [32m- C = A + B side test 2 [0m
>>  [32m- C = A + B side test 3 [0m
>> ArrayBuffer(0, 1, 2, 3, 4)
>> ArrayBuffer(0, 1, 2, 3, 4)
>>  [32m- general side [0m
>>  [32m- Ax [0m
>>  [32m- A'x [0m
>>  [32m- colSums, colMeans [0m
>>  [36mRun completed in 1 minute, 31 seconds. [0m
>>  [36mTotal number of tests run: 38 [0m
>>  [36mSuites: completed 9, aborted 0 [0m
>>  [36mTests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0 [0m
>>  [32mAll tests passed. [0m
>> [INFO]
>> [INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact
>> (remove-old-mahout-artifacts) @ mahout-spark ---
>> [INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark
>> removed.
>> [INFO]
>> [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark ---
>> [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
>> [INFO]
>> [INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark ---
>> [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
>> [INFO]
>> [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @
>> mahout-spark ---
>> [INFO] Building jar: /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
>> [INFO]
>> [INFO] --- maven-install-plugin:2.5.1:install (default-install) @
>> mahout-spark ---
>> [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar to
>> /home/jenkins/.m2/repository/org/apache/mahout/mahout-
>> spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar
>> [INFO] Installing 
>> /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/pom.xml
>> to /home/jenkins/.m2/repository/org/apache/mahout/mahout-
>> spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom
>> [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar to
>> /home/jenkins/.m2/repository/org/apache/mahout/mahout-
>> spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar
>> [INFO] Installing /x1/jenkins/jenkins-slave/workspace/Mahout-Quality/
>> trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar to
>> /home/jenkins/.m2/repository/org/apache/mahout/mahout-
>> spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar
>> [INFO]
>> [INFO] >>> maven-javadoc-plugin:2.9.1:javadoc (default-cli) @
>> mahout-spark >>>
>> [INFO]
>> [INFO] --- build-helper-maven-plugin:1.8:add-sou

[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002019#comment-14002019
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

yes. please start off, do a pull request, it is a good way to have a concrete 
discussion. thanks.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1522) Handle logging levels via log4j.xml

2014-05-19 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001878#comment-14001878
 ] 

Andrew Musselman commented on MAHOUT-1522:
--

Haven't touched it; overbooked at work the past couple months.

> Handle logging levels via log4j.xml
> ---
>
> Key: MAHOUT-1522
> URL: https://issues.apache.org/jira/browse/MAHOUT-1522
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 1.0
>
>
> We don't have a properties file to tell log4j what to do, so we inherit other 
> frameworks' settings.
> Suggestion is to add a log4j.xml file in a canonical place and set up logging 
> levels, maybe separating out components for ease of setting levels during 
> debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-19 Thread Stevo Slavić
+1


On Mon, May 19, 2014 at 5:21 PM, Grant Ingersoll wrote:

> +1
>
> On May 16, 2014, at 2:02 PM, Dmitriy Lyubimov  wrote:
>
> > Hi,
> >
> > I would like to initiate a procedural vote moving to git as our primary
> > commit system, and using github PRs as described in Jake Farrel's email
> to
> > @dev [1]
> >
> > [1]
> >
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
> >
> > If voting succeeds, i will file a ticket with infra to commence necessary
> > changes and to move our project to git-wp as primary source for commits
> as
> > well as add github integration features [1]. (I assume pure git commits
> > will be required after that's done, with no svn commits allowed).
> >
> > The motivation is to engage GIT and github PR features as described, and
> > avoid git mirror history messes like we've seen associated with
> authors.txt
> > file fluctations.
> >
> > PMC and committers have binding votes, so please vote. Lazy consensus
> with
> > minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra
> time
> > for weekend (i.e. Tuesday afternoon PST) .
> >
> > here is my +1
> >
> > -d
>
> 
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-19 Thread Grant Ingersoll
+1

On May 16, 2014, at 2:02 PM, Dmitriy Lyubimov  wrote:

> Hi,
> 
> I would like to initiate a procedural vote moving to git as our primary
> commit system, and using github PRs as described in Jake Farrel's email to
> @dev [1]
> 
> [1]
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
> 
> If voting succeeds, i will file a ticket with infra to commence necessary
> changes and to move our project to git-wp as primary source for commits as
> well as add github integration features [1]. (I assume pure git commits
> will be required after that's done, with no svn commits allowed).
> 
> The motivation is to engage GIT and github PR features as described, and
> avoid git mirror history messes like we've seen associated with authors.txt
> file fluctations.
> 
> PMC and committers have binding votes, so please vote. Lazy consensus with
> minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
> for weekend (i.e. Tuesday afternoon PST) .
> 
> here is my +1
> 
> -d


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (MAHOUT-1470) Topic dump

2014-05-19 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001876#comment-14001876
 ] 

Andrew Musselman commented on MAHOUT-1470:
--

No progress; been overbooked at work.  If I don't get to it this week we could 
ask someone else to take it.

> Topic dump
> --
>
> Key: MAHOUT-1470
> URL: https://issues.apache.org/jira/browse/MAHOUT-1470
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 1.0
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Per 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Mahout's missing a clusterdump utility that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-19 Thread Sebastian Schelter
Can you create in an svn compatible way and check that it works with the
current trunk?

Thx, sebastian
Am 19.05.2014 12:01 schrieb "larryhu (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001554#comment-14001554]
>
> larryhu commented on MAHOUT-1543:
> -
>
> I'm so sorry for your trouble, this patch is created by git, I clone it
> from github. tag: mahout-0.7.
>
> > JSON output format for classifying with random forests
> > --
> >
> > Key: MAHOUT-1543
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Classification
> >Affects Versions: 0.7, 0.8, 0.9
> >Reporter: larryhu
> >  Labels: patch
> > Fix For: 0.7
> >
> > Attachments: MAHOUT-1543.patch
> >
> >
> > This patch adds JSON output format to build random forests,
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-19 Thread larryhu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001554#comment-14001554
 ] 

larryhu commented on MAHOUT-1543:
-

I'm so sorry for your trouble, this patch is created by git, I clone it from 
github. tag: mahout-0.7.

> JSON output format for classifying with random forests
> --
>
> Key: MAHOUT-1543
> URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: larryhu
>  Labels: patch
> Fix For: 0.7
>
> Attachments: MAHOUT-1543.patch
>
>
> This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1485) Clean up Recommender Overview page

2014-05-19 Thread Yash Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001542#comment-14001542
 ] 

Yash Sharma commented on MAHOUT-1485:
-

Roger that. Would update the doc soon.

> Clean up Recommender Overview page
> --
>
> Key: MAHOUT-1485
> URL: https://issues.apache.org/jira/browse/MAHOUT-1485
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> Clean up the recommender overview page, remove outdated content and make sure 
> the examples work.
> https://mahout.apache.org/users/recommender/recommender-documentation.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-19 Thread Sergey (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001434#comment-14001434
 ] 

Sergey commented on MAHOUT-1498:


Great, nice to hear it.
Looks like I have the  similar problem here. It appears during execution as 
oozie java action. I would inverstigate it and create separate ticket if I find 
the root cause of problem. ClusterClassificationDriver is much more difficult 
to read than previously patched modules. 
{code}
at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at 
org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{code}

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>Assignee: Sebastian Schelter
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1498.patch
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)