[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 Currently failing to download deps. ``` [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.4:process (default) on project zeppelin: Error downloading resources archive. Could not transfer artifact org.apache.apache.resources:apache-jar-resource-bundle:jar:1.5-SNAPSHOT from/to codehaus-snapshots (https://nexus.codehaus.org/snapshots/): nexus.codehaus.org: Name or service not known ``` Will try again in a little while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 All failures due to Mahout doesn't support scala 2.11 yet. Adding logic to detect, similar to detecting spark v < 1.5, and adding to testing suite. @bzz, pyspark issues seemed to have resolved themselves. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user bzz commented on the issue: https://github.com/apache/zeppelin/pull/928 Got it, thank you so much for digging into it! Let me try to look into it more this week --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 @bzz, I can't recreate the build failure. I can say - Spark, pySpark, and Mahout notebooks and paragraphs run as expected. - Spark and pySpark tests pass. Also, integration tests pass in `zeppelin-server`. The only thing that fails is the Spark Cluster test. - The part of the Spark Cluster Test that fails is python not being found when testing via the REST API - I can also confirm that all of the failing tests ALSO work as expected against a built Zeppein (see following python script to recreate tests) ``` python # build zeppelin like this: # # mvn clean package -DskipTests -Psparkr -Ppyspark -Pspark-1.6 from requests import post, get, delete from json import dumps ZEPPELIN_SERVER = "localhost" ZEPPELIN_PORT = 8080 base_url = "http://%s:%i; % (ZEPPELIN_SERVER, ZEPPELIN_PORT) def create_notebook(name_of_new_notebook): payload = {"name": name_of_new_notebook} notebook_url = base_url + "/api/notebook" r = post(notebook_url, dumps(payload)) return r.json() def delete_notebook(notebook_id): target_url = base_url + "/api/notebook/%s" % notebook_id r = delete(target_url) return r def create_paragraph(code, notebook_id, title=""): target_url = base_url + "/api/notebook/%s/paragraph" % notebook_id payload = { "title": title, "text": code } r = post(target_url, dumps(payload)) return r.json()["body"] notebook_id = create_notebook("test1")["body"] test_codes = [ "%spark print(sc.parallelize(1 to 10).reduce(_ + _))", "%r localDF <- data.frame(name=c(\"a\", \"b\", \"c\"), age=c(19, 23, 18))\n" + "df <- createDataFrame(sqlContext, localDF)\n" + "count(df)", "%pyspark print(sc.parallelize(range(1, 11)).reduce(lambda a, b: a + b))", "%pyspark print(sc.parallelize(range(1, 11)).reduce(lambda a, b: a + b))", "%pyspark\nfrom pyspark.sql.functions import *\n" + "print(sqlContext.range(0, 10).withColumn('uniform', rand(seed=10) * 3.14).count())", "%spark z.run(1)" ] para_ids = [create_paragraph(c, notebook_id) for c in test_codes] # run all paragraphs: post(base_url + "/api/notebook/job/%s" % notebook_id) #delete_notebook(notebook_id) ``` After two weeks of chasing dead ends and my tail, I call this is an issue with the testing env, not the mahout interpreter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 @bzz and @Leemoonsoo A big part of the refactor was introducing no new dependencies- instead loading from maven or MAHOUT_HOME at interpretter start up via dependency resolver. This was done to get around version mismatches. Here is output from `mvn dependency:tree` I'm not super familiar, so I'm not 100% sure there are nothing new, but I am like 97% sure. ``` [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ zeppelin-mahout --- [INFO] org.apache.zeppelin:zeppelin-mahout:jar:0.6.0-SNAPSHOT [INFO] +- org.slf4j:slf4j-api:jar:1.7.10:compile [INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.10:compile [INFO] | \- log4j:log4j:jar:1.2.17:compile [INFO] +- org.apache.zeppelin:zeppelin-interpreter:jar:0.6.0-SNAPSHOT:provided [INFO] +- org.apache.zeppelin:zeppelin-spark:jar:0.6.0-SNAPSHOT:compile [INFO] +- org.apache.zeppelin:zeppelin-spark-dependencies:jar:0.6.0-SNAPSHOT:compile [INFO] +- com.google.code.gson:gson:jar:2.2:compile [INFO] +- org.datanucleus:datanucleus-core:jar:3.2.10:compile [INFO] +- org.datanucleus:datanucleus-api-jdo:jar:3.2.6:compile [INFO] +- org.datanucleus:datanucleus-rdbms:jar:3.2.9:compile [INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile [INFO] +- org.scala-lang:scala-compiler:jar:2.10.4:compile [INFO] +- org.scala-lang:scala-reflect:jar:2.10.4:compile [INFO] \- junit:junit:jar:4.11:test [INFO]\- org.hamcrest:hamcrest-core:jar:1.3:test ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 @bzz not quite done. A little more testing and I realized, jars aren't being properly loaded when Spark is in cluster mode. Think you could take a peek and try to give me a hint why that might be? Either way, I hope to to have it all worked out by EOD tomorrow. Will rebase when I get cluster working then should be gtg! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user bzz commented on the issue: https://github.com/apache/zeppelin/pull/928 Great work, @rawkintrevo ! It looks like a rebase on the latest master is needed now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 Per this: http://stackoverflow.com/questions/32498891/spark-read-and-write-to-parquet-leads-to-outofmemoryerror-java-heap-space and this: http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization I set the buffer to 64k and the buffer.max to 1g, and that seems to have solved it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 As I am playing with this, things seem to stop/start working at random... The Thrift Server error in the Zepplin context, with Java Heap Space errors related to the kryo serializer in the logs. Some sort of mis-configuration errors (?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 UPDATE: Sorry for the quick one-two punch. But the above error only occurs in Spark cluster mode, not in Spark local mode. Leading me to believe jars aren't getting loaded up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 Consider the code ```scala %mahout val mxRnd = Matrices.symmetricUniformView(5000, 2, 1234) val drmRand = drmParallelize(mxRnd) val drmSin = drmRand.mapBlock() {case (keys, block) => val blockB = block.like() for (i <- 0 until block.nrow) { blockB(i, 0) = block(i, 0) blockB(i, 1) = Math.sin((block(i, 0) * 8)) } keys -> blockB } val drmRand = drmParallelize(mxRnd) drmSampleToTSV(drmRand, 0.5) ``` Works fine, I can take the resulting string and do things with it. However when I run ```scala drmRand.collect(::, 1) ``` The error depends if I am using the Zeppelin Dependency Resolver (e.g. downloading jars from Maven) or pointing to `MAHOUT_HOME`. If using the dependency resolver, the error is: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ... ^^ This is a standard error message when Zepplin-Spark and the Spark on the Cluster are version mismatched, however this was curious as I was running Spark in local mode. I added provided to the zeppelin-spark dependency in the mahout pom.xml, now both local and MAHOUT_HOME give the same error message which is as follows. This lead me to believe (and I still do) the problem lies in version mismatching?: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.86.152): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:207) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user dlyubimov commented on the issue: https://github.com/apache/zeppelin/pull/928 what's the message? `DRMLike.collect()`, eventually, is a translation to `RDD.collect()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 This appears to be working, but there is a bug when doing OLS regarding the thift server. It is the same error message one normally gets when trying to use incompatible versoin of spark. Is triggered by ".collect" method on drm. But works fine for other operations. Very confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #928: [ZEPPELIN-116][WIP] Add Mahout Support for Spark Interp...
Github user rawkintrevo commented on the issue: https://github.com/apache/zeppelin/pull/928 If someone could help me out I'd appreciate it... First of all, this works fine as expected in the notebooks (either way). In MahoutSparkInterpreter.java line 89, there is a jar I can load. If I don't load it, on testing only, I get the following error: ``` No Import : org.apache.mahout:mahout-spark-shell_2.10:0.12.1 :35: error: type mismatch; found : org.apache.spark.SparkContext required: org.apache.mahout.sparkbindings.SparkDistributedContext implicit val sdc: SparkDistributedContext = sc ^ ``` If I do load it, again in testing only, I get the following errors (truncated): ``` With org.apache.mahout:mahout-spark-shell_2.10:0.12.1: == Enclosing template or block == Import( // val : ImportType(org.apache.mahout.sparkbindings) "org"."apache"."mahout"."sparkbindings" // final package sparkbindings in package mahout, tree.tpe=org.apache.mahout.sparkbindings.type List( ImportSelector(_,1227,null,-1) ) ) uncaught exception during compilation: scala.reflect.internal.FatalError :4: error: illegal inheritance; self-type line2125238280$23.$read does not conform to scala.Serializable's selftype Serializable class $read extends Serializable { ^ :4: error: Serializable does not have a constructor class $read extends Serializable { ^ :6: error: illegal inheritance; self-type $read.this.$iwC does not conform to scala.Serializable's selftype Serializable class $iwC extends Serializable { ^ :6: error: Serializable does not have a constructor class $iwC extends Serializable { ^ :11: error: illegal inheritance; self-type $iwC does not conform to scala.Serializable's selftype Serializable class $iwC extends Serializable { ^ ... ``` Since this works in the actual notebooks, I'm not sure what I'm doing wrong in setting up the testing env. Any advice would be appreciated. Thanks, tg --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---