[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-07-27 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
Currently failing to download deps.

```
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-remote-resources-plugin:1.4:process (default) on 
project zeppelin: Error downloading resources archive. Could not transfer 
artifact 
org.apache.apache.resources:apache-jar-resource-bundle:jar:1.5-SNAPSHOT from/to 
codehaus-snapshots (https://nexus.codehaus.org/snapshots/): nexus.codehaus.org: 
Name or service not known
```

Will try again in a little while. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-07-26 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
All failures due to Mahout doesn't support scala 2.11 yet.  Adding logic to 
detect, similar to detecting spark v < 1.5, and adding to testing suite. 

@bzz, pyspark issues seemed to have resolved themselves. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-07-14 Thread bzz
Github user bzz commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
Got it, thank you so much for digging into it!

Let me try to look into it more this week


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-07-13 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
@bzz, I can't recreate the build failure.

I can say
- Spark, pySpark, and Mahout notebooks and paragraphs run as expected.
- Spark and pySpark tests pass. Also, integration tests pass in 
`zeppelin-server`.  The only thing that fails is the Spark Cluster test. 
- The part of the Spark Cluster Test that fails is python not being found 
when testing via the REST API
- I can also confirm that all of the failing tests ALSO work as expected 
against a built Zeppein (see following python script to recreate tests)


``` python

# build zeppelin like this:
#
# mvn clean package -DskipTests -Psparkr -Ppyspark -Pspark-1.6

from requests import post, get, delete
from json import dumps

ZEPPELIN_SERVER = "localhost"
ZEPPELIN_PORT = 8080
base_url = "http://%s:%i; % (ZEPPELIN_SERVER, ZEPPELIN_PORT)



def create_notebook(name_of_new_notebook):
payload = {"name": name_of_new_notebook}
notebook_url = base_url + "/api/notebook"
r = post(notebook_url, dumps(payload))
return r.json()

def delete_notebook(notebook_id):
target_url = base_url + "/api/notebook/%s" % notebook_id
r = delete(target_url)
return r


def create_paragraph(code, notebook_id, title=""):
target_url = base_url + "/api/notebook/%s/paragraph" % notebook_id
payload = { "title": title, "text": code }
r = post(target_url, dumps(payload))
return r.json()["body"]


notebook_id = create_notebook("test1")["body"]

test_codes = [
"%spark print(sc.parallelize(1 to 10).reduce(_ + _))",
"%r localDF <- data.frame(name=c(\"a\", \"b\", \"c\"), age=c(19, 23, 
18))\n" +
"df <- createDataFrame(sqlContext, localDF)\n" +
"count(df)",
"%pyspark print(sc.parallelize(range(1, 11)).reduce(lambda a, b: a + 
b))",
"%pyspark print(sc.parallelize(range(1, 11)).reduce(lambda a, b: a + 
b))",
"%pyspark\nfrom pyspark.sql.functions import *\n"
+ "print(sqlContext.range(0, 10).withColumn('uniform', rand(seed=10) * 
3.14).count())",
"%spark z.run(1)"
]

para_ids = [create_paragraph(c, notebook_id) for c in test_codes]

# run all paragraphs:
post(base_url + "/api/notebook/job/%s" % notebook_id)

#delete_notebook(notebook_id)
```

After two weeks of chasing dead ends and my tail, I call this is an issue 
with the testing env, not the mahout interpreter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-17 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
@bzz and @Leemoonsoo 

A big part of the refactor was introducing no new dependencies- instead 
loading from maven or MAHOUT_HOME at interpretter start up via dependency 
resolver.

This was done to get around version mismatches.

Here is output from `mvn dependency:tree`  I'm not super familiar, so I'm 
not 100% sure there are nothing new, but I am like 97% sure. 
```
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ zeppelin-mahout 
---
[INFO] org.apache.zeppelin:zeppelin-mahout:jar:0.6.0-SNAPSHOT
[INFO] +- org.slf4j:slf4j-api:jar:1.7.10:compile
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.10:compile
[INFO] |  \- log4j:log4j:jar:1.2.17:compile
[INFO] +- 
org.apache.zeppelin:zeppelin-interpreter:jar:0.6.0-SNAPSHOT:provided
[INFO] +- org.apache.zeppelin:zeppelin-spark:jar:0.6.0-SNAPSHOT:compile
[INFO] +- 
org.apache.zeppelin:zeppelin-spark-dependencies:jar:0.6.0-SNAPSHOT:compile
[INFO] +- com.google.code.gson:gson:jar:2.2:compile
[INFO] +- org.datanucleus:datanucleus-core:jar:3.2.10:compile
[INFO] +- org.datanucleus:datanucleus-api-jdo:jar:3.2.6:compile
[INFO] +- org.datanucleus:datanucleus-rdbms:jar:3.2.9:compile
[INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile
[INFO] +- org.scala-lang:scala-compiler:jar:2.10.4:compile
[INFO] +- org.scala-lang:scala-reflect:jar:2.10.4:compile
[INFO] \- junit:junit:jar:4.11:test
[INFO]\- org.hamcrest:hamcrest-core:jar:1.3:test
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
@bzz not quite done.  A little more testing and I realized, jars aren't 
being properly loaded when Spark is in cluster mode.  Think you could take a 
peek and try to give me a hint why that might be?

Either way, I hope to to have it all worked out by EOD tomorrow.  Will 
rebase when I get cluster working then should be gtg!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread bzz
Github user bzz commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
Great work, @rawkintrevo ! It looks like a rebase on the latest master is 
needed now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
Per this:

http://stackoverflow.com/questions/32498891/spark-read-and-write-to-parquet-leads-to-outofmemoryerror-java-heap-space

and this:

http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization

I set the buffer to 64k and the buffer.max to 1g, and that seems to have 
solved it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
As I am playing with this, things seem to stop/start working at random... 
The Thrift Server error in the Zepplin context, with Java Heap Space errors 
related to the kryo serializer in the logs.  Some sort of mis-configuration 
errors (?) 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
UPDATE: 

Sorry for the quick one-two punch. But the above error only occurs in Spark 
cluster mode, not in Spark local mode. Leading me to believe jars aren't 
getting loaded up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
Consider the code
```scala
%mahout

val mxRnd = Matrices.symmetricUniformView(5000, 2, 1234)
val drmRand = drmParallelize(mxRnd)


val drmSin = drmRand.mapBlock() {case (keys, block) =>  
  val blockB = block.like()
  for (i <- 0 until block.nrow) {
blockB(i, 0) = block(i, 0) 
blockB(i, 1) = Math.sin((block(i, 0) * 8))
  }
  keys -> blockB
}

val drmRand = drmParallelize(mxRnd)
drmSampleToTSV(drmRand, 0.5)
```
Works fine, I can take the resulting string and do things with it.

However when I run
```scala
drmRand.collect(::, 1)
```

The error depends if I am using the Zeppelin Dependency Resolver (e.g. 
downloading jars from Maven) or pointing to `MAHOUT_HOME`.  

If using the dependency resolver, the error is:
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
...

^^ This is a standard error message when Zepplin-Spark and the Spark on the 
Cluster are version mismatched, however this was curious as I was running Spark 
in local mode. 
I added provided to the zeppelin-spark dependency in the 
mahout pom.xml,  now both local and MAHOUT_HOME give the same error message 
which is as follows.  This lead me to believe (and I still do) the problem lies 
in version mismatching?:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
(TID 3, 192.168.86.152): java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:207)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 

[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread dlyubimov
Github user dlyubimov commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
what's the message? 
`DRMLike.collect()`, eventually,  is a translation to `RDD.collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-16 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
This appears to be working, but there is a bug when doing OLS regarding the 
thift server. It is the same error message one normally gets when trying to use 
incompatible versoin of spark. Is triggered by ".collect" method on drm.  But 
works fine for other operations.  Very confusing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...

2016-06-14 Thread rawkintrevo
Github user rawkintrevo commented on the issue:

https://github.com/apache/zeppelin/pull/928
  
If someone could help me out I'd appreciate it...

First of all, this works fine as expected in the notebooks (either way). 

In MahoutSparkInterpreter.java line 89, there is a jar I can load. 

If I don't load it, on testing only, I get the following error:
```
No Import : org.apache.mahout:mahout-spark-shell_2.10:0.12.1
:35: error: type mismatch;
 found   : org.apache.spark.SparkContext
 required: org.apache.mahout.sparkbindings.SparkDistributedContext
   implicit val sdc: SparkDistributedContext = sc
   ^
```

If I do load it, again in testing only, I get the following errors 
(truncated):
```
With org.apache.mahout:mahout-spark-shell_2.10:0.12.1:

== Enclosing template or block ==

Import( // val : ImportType(org.apache.mahout.sparkbindings)
  "org"."apache"."mahout"."sparkbindings" // final package sparkbindings in 
package mahout, tree.tpe=org.apache.mahout.sparkbindings.type
  List(
ImportSelector(_,1227,null,-1)
  )
)

uncaught exception during compilation: scala.reflect.internal.FatalError
:4: error: illegal inheritance;
 self-type line2125238280$23.$read does not conform to scala.Serializable's 
selftype Serializable
class $read extends Serializable {
^
:4: error: Serializable does not have a constructor
class $read extends Serializable {
^
:6: error: illegal inheritance;
 self-type $read.this.$iwC does not conform to scala.Serializable's 
selftype Serializable
  class $iwC extends Serializable {
 ^
:6: error: Serializable does not have a constructor
  class $iwC extends Serializable {
 ^
:11: error: illegal inheritance;
 self-type $iwC does not conform to scala.Serializable's selftype 
Serializable
class $iwC extends Serializable {
   ^
...
```

Since this works in the actual notebooks, I'm not sure what I'm doing wrong 
in setting up the testing env.  Any advice would be appreciated.

Thanks,

tg



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---