[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004278#comment-14004278
 ] 

Anand Avati commented on MAHOUT-1490:
-

The "freeze" concept in H2O is orthogonal to compression/decompression 
(Compressed = Chunk, Decompressed = NewChunk. inflate() is for inflating Chunk 
to NewChunk, compress() is for compressing NewChunk to Chunk). Freezeable is 
H2O's in-house serialization framework (think of it as Kryo alternative) 
hardwired to deliver highest performance by "manually" hand rolling fields into 
a buffer, though it is not as "general purpose" as Serializable or Kryo.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004275#comment-14004275
 ] 

Anand Avati commented on MAHOUT-1490:
-

[~dlyubimov] it is true that unless you iterate over the data multiple times, 
type-compression (scaling,biasing, reducing bit-width) does not give a lot of 
benefit. However, if random and mixed read/write is the expected access, the 
overheads of inflation can be minimized by choosing a smaller Chunk size (which 
will not worsen the compression.) Really depends on the use case of these 
R-like data frame bindings in Mahout (of which I do not know much). 
Type-compression apart, sparse compression is something which is probably still 
applicable to just scale to larger dimensions.

Naive question - Are these "Data frame" bindings really for just interactive 
use case? Or do we expect ML algos to be implemented on top of Data frames 
(instead of just DRM/matrix)?

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Quality #2611

2014-05-20 Thread Apache Jenkins Server
See 

--
[...truncated 8426 lines...]
}

Q=
{
  0  => {0:0.40273861426601687,1:-0.9153150324187648}
  1  => {0:0.9153150324227656,1:0.40273861426427493}
}
- C = A %*% B mapBlock {}
- C = A %*% B incompatible B keys
36896 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join
38363 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join, String-keyed
39772 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are identically 
distributed, performing row-wise zip.
- C = At %*% B , zippable, String-keyed
{
  2  => {0:62.0,1:86.0,3:132.0,2:115.0}
  1  => {0:50.0,1:69.0,3:105.0,2:92.0}
  3  => {0:74.0,1:103.0,3:159.0,2:138.0}
  0  => {0:26.0,1:35.0,3:51.0,2:46.0}
}
- C = A %*% inCoreB
{
  0  => {0:26.0,1:35.0,2:46.0,3:51.0}
  1  => {0:50.0,1:69.0,2:92.0,3:105.0}
  2  => {0:62.0,1:86.0,2:115.0,3:132.0}
  3  => {0:74.0,1:103.0,2:138.0,3:159.0}
}
- C = inCoreA %*%: B
44036 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A
45580 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying non-slim non-graph A'A.
76801 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings  - test done.
- C = A.t %*% A fat non-graph
78062 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A non-int key
- C = A + B
- C = A + B side test 1
- C = A + B side test 2
- C = A + B side test 3
ArrayBuffer(0, 1, 2, 3, 4)
ArrayBuffer(0, 1, 2, 3, 4)
- general side
- Ax
- A'x
- colSums, colMeans
Run completed in 1 minute, 36 seconds.
Total number of tests run: 38
Suites: completed 9, aborted 0
Tests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-spark ---
[INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark removed.
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark ---
[INFO] Building jar: 

[INFO] 
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark ---
[INFO] Building jar: 

[INFO] 
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ 
mahout-spark ---
[INFO] Building jar: 

[INFO] 
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ mahout-spark 
---
[INFO] Installing 

 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar
[INFO] Installing 
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom
[INFO] Installing 

 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO] Installing 

 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark >>>
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:add-source (add-source) @ mahout-spark 
---
[INFO] Source directory: 

 added.
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:add-test-source (add-test-source) @ 
mahout-spark ---
[INFO] Test Source directory: 

 added.
[INFO] 
[INFO] <<< maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark <<<
[INFO] 
[INFO] --- maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ ma

[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004160#comment-14004160
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

Also in a realistic case, we will be reading the frame blocks off media which 
does not internally use that compression (most likely, the media would be 
row-wise). So compression will stream in uncompressed data and will already 
have the memory bottleneck. So in order to justify compression in these 
scenarios, we need to make sure that compressed source will be iterated over 
more than one time. Again, this is all just a programming model. 

For example, there might be an api that says "build fast iterative source" 
explicitly, rather than always assume it is a good thing. I kind of suspect 
that's what h2o "freeze" concept encompasses.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004149#comment-14004149
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

i did not say it was bad. I said the same things, it is good for certain 
algorithms. Actually, a lot of algorithms, of iterative nature.

I am just thinking how to expose cost to algorithm layer so it doesn't do naive 
things. The programming model, it is all about it. imagine an algorithm that 
does something like gaussian elimnation or givens QR. Obviously compression 
doesn't help here since inflate/deflate cycle will cost more than any benefits 
of compressed reads, it would seem it would be faster with just uncompressed 
vectors.

Fortunately we don't have to care about delayed updates since we are doing 100% 
in-core local operation here. 

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004141#comment-14004141
 ] 

Anand Avati commented on MAHOUT-1490:
-

Yes, I too think the inflate/deflate is not as bad as it sounds. Note that 
reads can still happen in inflated mode, both for the updated values and other 
values, so a deflate is not a necessity blocking other operations.


> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004130#comment-14004130
 ] 

Ted Dunning commented on MAHOUT-1490:
-

D,

Many algorithms can handle delayed updates.  In those cases, you can batch up 
the updates and thus amortize the inflate/update/deflate cycle.

Also, I don't think that random access is quite as bad as it sounds.  Certainly 
sequential access is preferable, but the effect of cache lines and grouped 
accesses can often mask the costs.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-20 Thread Dmitriy Lyubimov
​Filed as INFRA-.


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-20 Thread Dmitriy Lyubimov
Vote has passed.

+1 : 10
others: 0



On Mon, May 19, 2014 at 8:22 AM, Stevo Slavić  wrote:

> +1
>
>
> On Mon, May 19, 2014 at 5:21 PM, Grant Ingersoll  >wrote:
>
> > +1
> >
> > On May 16, 2014, at 2:02 PM, Dmitriy Lyubimov  wrote:
> >
> > > Hi,
> > >
> > > I would like to initiate a procedural vote moving to git as our primary
> > > commit system, and using github PRs as described in Jake Farrel's email
> > to
> > > @dev [1]
> > >
> > > [1]
> > >
> >
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
> > >
> > > If voting succeeds, i will file a ticket with infra to commence
> necessary
> > > changes and to move our project to git-wp as primary source for commits
> > as
> > > well as add github integration features [1]. (I assume pure git commits
> > > will be required after that's done, with no svn commits allowed).
> > >
> > > The motivation is to engage GIT and github PR features as described,
> and
> > > avoid git mirror history messes like we've seen associated with
> > authors.txt
> > > file fluctations.
> > >
> > > PMC and committers have binding votes, so please vote. Lazy consensus
> > with
> > > minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra
> > time
> > > for weekend (i.e. Tuesday afternoon PST) .
> > >
> > > here is my +1
> > >
> > > -d
> >
> > 
> > Grant Ingersoll | @gsingers
> > http://www.lucidworks.com
> >
> >
> >
> >
> >
> >
>


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004037#comment-14004037
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

if inflate/deflate cycle is needed to update an element, i take it it changes 
entire backing chunk representation, doesn't it. That's what i mean by 
immutable, scala immutable apis are then mutable in the same sense (only by 
applying a functor).

It is important because data frame blockwise transformations may be updating 
its content (vector chunks) at random coordinates. Obviously inflate-deflate 
cycle for _each_ update makes it incredibly inefficient. You seem to imply the 
cycle of inflate -> do all local task updates -> deflate again. this is far 
from a general algorithm pattern of random elementwise gets and sets (in-core 
operations with getQuick() and setQuick() in Mahout's sense). it also has 
further profound distributed plan implications (determine boundaries of single 
map() fusion operation in order to avoid inflate-deflate cycle between fused 
functors and monads etc). 

So the bottom line that i am driving at is if we consider a generic algorithm 
that does random sequence of element reads and writes, it can't really 
trivially capitalize on reading speed because essentially it would have to 
start working on inflated representation at potentially first random write. The 
only thing that more or less works is read-only access, and, for most part, 
sequential readonly access, which shines at compiling condensed summaries, but 
that's about it. In that, it seems to be awfully similar to 
SequentialAccessSparseVector (as opposed to RandomAccessSparseVector) in Mahout.

Sequential access usually implies a functor or monad, i.e. immutability of the 
source, and sequential result construction. This is the happiest part for this 
approach and also is incredibly common.

Nonsequential access may imply both in-place (i.e. writes to source) and 
non-inplace random writes (i.e. writes to a separate output). This happens in 
some cases as well. 

The challenge here is to find a balance, or somehow expose costs, of sequential 
access vs. random access vs. writes to the client algorithm for a particular 
backing implementation.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003903#comment-14003903
 ] 

Anand Avati commented on MAHOUT-1490:
-

[~dlyubimov], Compression does not make it read-only, certainly not read-only 
like Spark's RDD. Data in a Frame is mutable. Depending on the type of update 
either the update is cheap (if a new value can replace old value in-place) or 
expensive (inflate, update1, update2, update3 .. deflate) but in any case 
happens transparently behind the scene. User just calls set(). However, for the 
DSL backend I intend to _not_ mutate Frames and treat them read-only to be 
compatible with the Spark RDD model (even though it might not be the most 
efficient in certain cases in terms of performance).

Speed to access data is constant time for dense compressed data with negligible 
decompression overhead (one multiplaction and one addition instruction with 
operands in registers). The chunk header knows the scale-down factor of 
compression, so it is a deterministic offset lookup to fetch the compressed 
value as well. For sparse data however the worst case is a binary search to 
find the physical offset within a Chunk, though there are optimizations to make 
further accesses in the same vicinity to happen in constant time.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #834

2014-05-20 Thread Apache Jenkins Server
See 


Changes:

[sslavic] Fixed Maven warnings (usage of deprecated pom.version, and missing 
plugin version for maven-scala-plugin)

[sslavic] Configured svn:ignore for mahout-spark and mahout-spark-shell modules

[ssc] MAHOUT-1388 Add command line support and logging for MLP

[ssc] MAHOUT-1498 DistributedCache.setCacheFiles in DictionaryVectorizer 
overwrites jars pushed using oozie

[ssc] MAHOUT-1385 Caching Encoders don't cache

[ssc] MAHOUT-1527 Fix wikipedia classifier example

[ssc] MAHOUT-1542 Tutorial for playing with Mahout's Spark shell

--
[...truncated 2189 lines...]
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortCharProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortIntProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortShortProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortLongProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortFloatProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/ShortDoubleProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongByteProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongCharProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongIntProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongShortProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongLongProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongFloatProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/LongDoubleProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatByteProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatCharProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatIntProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatShortProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatLongProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/function/FloatFloatProcedure.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/functi

[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003727#comment-14003727
 ] 

Dmitriy Lyubimov commented on MAHOUT-1529:
--

DRM is legacy Mahout format inherited from all map reduce solvers. 

Perhaps one of the most popular commands, `seq2sparse`, produces string keys 
(full document path name in the original corpus). A lot of solvers are agnostic 
propagators of the keys: SSVD -> U, both MR and DSL versions, so is DSPCA, 
thinQR, and (I think) current and future versions of factorizes such as ALS. 
For more examples of what key can be, see "Mahout In Action" -- or bug the 
authors. Going forward, i am very likely internally use a more involved object 
structures as a key payload.

I honestly don't see value in a separate "local" backend as Spark already 
provides one. It is very unlikely to be used.

Tuple definitions don't depend on Spark, at this point i don't see a reason to 
make them engine-specific.






> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003705#comment-14003705
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

@Anand: 

Am i correct that compression renders that datastructure read-only? If so, it 
suggests a specific life cycle (such as compute -> deflate -> for 1 to N use -> 
relinquish). Right?

Also, what are speed guarantees for accessing random ordinally indexed element?

thank you.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-20 Thread Dmitriy Lyubimov
you need to use something called "pull request".


On Mon, May 19, 2014 at 9:58 PM, Saikat Kanjilal (JIRA) wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002807#comment-14002807]
>
> Saikat Kanjilal commented on MAHOUT-1490:
> -
>
> Dmitry,
> I am not able to push my code changes to your repo, it says I dont have
> permissions, can you add me to the list of contributors, I've introduced 4
> new classes for the 4 data types (String/Integer/Double/Long) that extend
> the DataFrameLike trait and contain and Unsafe data type as an internal
> variable, I need to push these changes so let me know when the permissions
> are set or what may be the issue.
>
> Thanks in advance.
>
> > Data frame R-like bindings
> > --
> >
> > Key: MAHOUT-1490
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> > Project: Mahout
> >  Issue Type: New Feature
> >Reporter: Saikat Kanjilal
> >Assignee: Dmitriy Lyubimov
> > Fix For: 1.0
> >
> >   Original Estimate: 20h
> >  Remaining Estimate: 20h
> >
> > Create Data frame R-like bindings for spark
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: consensus statement?

2014-05-20 Thread Dmitriy Lyubimov
inline


On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter  wrote:

>
>>
> Let's take the next from our homepage as starting point. What should we
> add/remove/modify?
>
> 
> 
> The Mahout community decided to move its codebase onto modern data
> processing systems that offer a richer programming model and more efficient
> execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce
> algorithm implementations from now on. We will however keep our widely used
> MapReduce algorithms in the codebase and maintain them.
>
> We are building our future implementations on top of a

Scala

> DSL for linear algebraic operations which has been developed over the last
> months. Programs written in this DSL are automatically optimized and
> executed in parallel for Apache Spark.

More platforms to be added in the future.

>
> Furthermore, there is an experimental contribution undergoing which aims
> to integrate the h20 platform into Mahout.
> 
> 
>


Re: consensus statement?

2014-05-20 Thread Pat Ferrel
First is there anything we can’t agree on with that statement? I see nothing to 
disagree with personally, though I see no need to talk about potential outside 
contributions here, but I’ll let that slide.

If this is for the outside world then it needs to clearly answer:
1) If I want to run the _latest_ Mahout code what do I need to install/put into 
my lab or datacenter. The question is, “What am I buying into"
2) If I want to contribute, what does it mean that Mahout accepts no new 
mapreduce code. What is the alternative? What new code would be acceptable? We 
have rejected a couple proposed contribs because they were mapreduce. 

For #2
I’d change "Mahout will therefore reject new MapReduce algorithm 
implementations from now on.” to "Mahout will therefore reject new Hadoop 
MapReduce contributions--new Spark based contributions are welcome.”

For #1
Maybe the 'platform requirements' or 'installing on a cluster’ section is a 
better place to answer.

On May 20, 2014, at 12:42 AM, Sebastian Schelter  wrote:

On 05/18/2014 09:28 PM, Ted Dunning wrote:
> On Sun, May 18, 2014 at 11:33 AM, Sebastian Schelter  wrote:
> 
>> I suggest we start with a specific draft that someone prepares (maybe Ted
>> as he started the thread)
> 
> 
> This is a good strategy, and I am happy to start the discussion, but I
> wonder if it might help build consensus if somebody else started the ball
> rolling.
> 

Let's take the next from our homepage as starting point. What should we 
add/remove/modify?


The Mahout community decided to move its codebase onto modern data processing 
systems that offer a richer programming model and more efficient execution than 
Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm 
implementations from now on. We will however keep our widely used MapReduce 
algorithms in the codebase and maintain them.

We are building our future implementations on top of a DSL for linear algebraic 
operations which has been developed over the last months. Programs written in 
this DSL are automatically optimized and executed in parallel on Apache Spark.

Furthermore, there is an experimental contribution undergoing which aims to 
integrate the h20 platform into Mahout.




[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache

2014-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003336#comment-14003336
 ] 

Hudson commented on MAHOUT-1385:


FAILURE: Integrated in Mahout-Quality #2610 (See 
[https://builds.apache.org/job/Mahout-Quality/2610/])
MAHOUT-1385 Caching Encoders don't cache (ssc: rev 1595634)
* /mahout/trunk/CHANGELOG
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/encoders/CachingContinuousValueEncoder.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/encoders/CachingStaticWordValueEncoder.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/vectorizer/encoders/CachingEncoderTest.java


> Caching Encoders don't cache
> 
>
> Key: MAHOUT-1385
> URL: https://issues.apache.org/jira/browse/MAHOUT-1385
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Johannes Schulte
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms 
> added to the vector. However, the method "hashForProbe" inside this classes 
> is never called as the signature has String for the parameter original form 
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching 
> of the Strings hash code , that is used as a key in the cache map, triggering 
> another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003337#comment-14003337
 ] 

Hudson commented on MAHOUT-1527:


FAILURE: Integrated in Mahout-Quality #2610 (See 
[https://builds.apache.org/job/Mahout-Quality/2610/])
MAHOUT-1527 Fix wikipedia classifier example (ssc: rev 1595627)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/examples/bin/classify-wiki.sh
* /mahout/trunk/examples/src/test/resources/country10.txt
* 
/mahout/trunk/integration/src/main/java/org/apache/mahout/text/WikipediaToSequenceFile.java
* 
/mahout/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java


> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003335#comment-14003335
 ] 

Hudson commented on MAHOUT-1542:


FAILURE: Integrated in Mahout-Quality #2610 (See 
[https://builds.apache.org/job/Mahout-Quality/2610/])
MAHOUT-1542 Tutorial for playing with Mahout's Spark shell (ssc: rev 1595595)
* /mahout/trunk/CHANGELOG
* 
/mahout/trunk/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/MahoutSparkILoop.scala


> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2014-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1400#comment-1400
 ] 

Hudson commented on MAHOUT-1388:


FAILURE: Integrated in Mahout-Quality #2610 (See 
[https://builds.apache.org/job/Mahout-Quality/2610/])
MAHOUT-1388 Add command line support and logging for MLP (ssc: rev 1595684)
* /mahout/trunk/CHANGELOG
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/classifier/mlp/MultilayerPerceptron.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/classifier/mlp/NeuralNetwork.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/classifier/mlp/RunMultilayerPerceptron.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/classifier/mlp/TrainMultilayerPerceptron.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/classifier/mlp/Datasets.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/classifier/mlp/RunMultilayerPerceptronTest.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/classifier/mlp/TestMultilayerPerceptron.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/classifier/mlp/TestNeuralNetwork.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/classifier/mlp/TrainMultilayerPerceptronTest.java


> Add command line support and logging for MLP
> 
>
> Key: MAHOUT-1388
> URL: https://issues.apache.org/jira/browse/MAHOUT-1388
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 1.0
>Reporter: Yexi Jiang
>Assignee: Suneel Marthi
>  Labels: mlp, sgd
> Fix For: 1.0
>
> Attachments: Mahout-1388.patch, Mahout-1388.patch
>
>
> The user should have the ability to run the Perceptron from the command line.
> There are two programs to execute MLP, the training and labeling. The first 
> one takes the data as input and outputs the model, the second one takes the 
> model and unlabeled data as input and outputs the results.
> The parameters for training are as follows:
> 
> --input -i (input data)
> --skipHeader -sk // whether to skip the first row, this parameter is optional
> --labels -labels // the labels of the instances, separated by whitespace. 
> Take the iris dataset for example, the labels are 'setosa versicolor 
> virginica'.
> --model -mo  // in training mode, this is the location to store the model (if 
> the specified location has an existing model, it will update the model 
> through incremental learning), in labeling mode, this is the location to 
> store the result
> --update -u // whether to incremental update the model, if this parameter is 
> not given, train the model from scratch
> --output -o   // this is only useful in labeling mode
> --layersize -ls (no. of units per hidden layer) // use whitespace separated 
> number to indicate the number of neurons in each layer (including input layer 
> and output layer), e.g. '5 3 2'.
> --squashingFunction -sf // currently only supports Sigmoid
> --momentum -m 
> --learningrate -l
> --regularizationweight -r
> --costfunction -cf   // the type of cost function,
> 
> For example, train a 3-layer (including input, hidden, and output) MLP with 
> 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
> parameter would be:
> mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
> /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
> This command would read the training data from /tmp/training-data.csv and 
> write the trained model to /tmp/model.model.
> The parameters for labeling is as follows:
> -
> --input -i // input file path
> --columnRange -cr // the range of column used for feature, start from 0 and 
> separated by whitespace, e.g. 0 5
> --format -f // the format of input file, currently only supports csv
> --model -mo // the file path of the model
> --output -o // the output path for the results
> -
> If a user need to use an existing model, it will use the following command:
> mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
> Moreover, we should be providing default values if the user does not specify 
> any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003334#comment-14003334
 ] 

Hudson commented on MAHOUT-1498:


FAILURE: Integrated in Mahout-Quality #2610 (See 
[https://builds.apache.org/job/Mahout-Quality/2610/])
MAHOUT-1498 DistributedCache.setCacheFiles in DictionaryVectorizer overwrites 
jars pushed using oozie (ssc: rev 1595643)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/term/TFPartialVectorReducer.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.java
* 
/mahout/trunk/mrlegacy/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java
* 
/mahout/trunk/mrlegacy/src/test/java/org/apache/mahout/common/DistributedCacheFileLocationTest.java


> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>Assignee: Sebastian Schelter
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1498.patch
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Quality #2610

2014-05-20 Thread Apache Jenkins Server
See 

Changes:

[sslavic] Fixed Maven warnings (usage of deprecated pom.version, and missing 
plugin version for maven-scala-plugin)

[sslavic] Configured svn:ignore for mahout-spark and mahout-spark-shell modules

[ssc] MAHOUT-1388 Add command line support and logging for MLP

[ssc] MAHOUT-1498 DistributedCache.setCacheFiles in DictionaryVectorizer 
overwrites jars pushed using oozie

[ssc] MAHOUT-1385 Caching Encoders don't cache

[ssc] MAHOUT-1527 Fix wikipedia classifier example

[ssc] MAHOUT-1542 Tutorial for playing with Mahout's Spark shell

--
[...truncated 8427 lines...]
  1  => {0:0.9153150324227656,1:0.40273861426427493}
}
- C = A %*% B mapBlock {}
- C = A %*% B incompatible B keys
36682 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join
38190 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join, String-keyed
39661 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are identically 
distributed, performing row-wise zip.
- C = At %*% B , zippable, String-keyed
{
  2  => {0:62.0,1:86.0,3:132.0,2:115.0}
  1  => {0:50.0,1:69.0,3:105.0,2:92.0}
  3  => {0:74.0,1:103.0,3:159.0,2:138.0}
  0  => {0:26.0,1:35.0,3:51.0,2:46.0}
}
- C = A %*% inCoreB
{
  0  => {0:26.0,1:35.0,2:46.0,3:51.0}
  1  => {0:50.0,1:69.0,2:92.0,3:105.0}
  2  => {0:62.0,1:86.0,2:115.0,3:132.0}
  3  => {0:74.0,1:103.0,2:138.0,3:159.0}
}
- C = inCoreA %*%: B
43889 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A
45472 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying non-slim non-graph A'A.
70297 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings  - test done.
- C = A.t %*% A fat non-graph
71626 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A non-int key
- C = A + B
- C = A + B side test 1
- C = A + B side test 2
- C = A + B side test 3
ArrayBuffer(0, 1, 2, 3, 4)
ArrayBuffer(0, 1, 2, 3, 4)
- general side
- Ax
- A'x
- colSums, colMeans
Run completed in 1 minute, 31 seconds.
Total number of tests run: 38
Suites: completed 9, aborted 0
Tests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-spark ---
[INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark removed.
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO] 
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ 
mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO] 
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ mahout-spark 
---
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/pom.xml to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark >>>
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:add-source (add-source) @ mahout-spark 
---
[INFO] Source directory: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/targ

Re: consensus statement?

2014-05-20 Thread Sebastian Schelter

On 05/18/2014 09:28 PM, Ted Dunning wrote:

On Sun, May 18, 2014 at 11:33 AM, Sebastian Schelter  wrote:


I suggest we start with a specific draft that someone prepares (maybe Ted
as he started the thread)



This is a good strategy, and I am happy to start the discussion, but I
wonder if it might help build consensus if somebody else started the ball
rolling.



Let's take the next from our homepage as starting point. What should we 
add/remove/modify?



The Mahout community decided to move its codebase onto modern data 
processing systems that offer a richer programming model and more 
efficient execution than Hadoop MapReduce. Mahout will therefore reject 
new MapReduce algorithm implementations from now on. We will however 
keep our widely used MapReduce algorithms in the codebase and maintain them.


We are building our future implementations on top of a DSL for linear 
algebraic operations which has been developed over the last months. 
Programs written in this DSL are automatically optimized and executed in 
parallel on Apache Spark.


Furthermore, there is an experimental contribution undergoing which aims 
to integrate the h20 platform into Mahout.




[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ] 

Anand Avati edited comment on MAHOUT-1529 at 5/20/14 7:06 AM:
--

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

- 'type DrmTuple[k] = (K, Vector)' is probably better placed in 
spark/../package.scala I think, as it is really an artifact of how the RDD is 
defined. However, BlockifiedDrmTuple[K] probably still belongs to math-scala.


was (Author: avati):
[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ] 

Anand Avati edited comment on MAHOUT-1529 at 5/20/14 7:03 AM:
--

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?


was (Author: avati):
[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)