Re: Joining data using Latitude, Longitude

2015-03-12 Thread Andrew Musselman
Ted Dunning and Ellen Friedman's Time Series Databases has a section on
this with some approaches to geo-encoding:

https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will naturally
 be sorted by proximity (with some edge cases so watch out).  If you go the
 join route, either by trimming the lat-lngs or geohashing them, you’re
 essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure you
 have a lot of options to join on the nearest co-ordinate. If you are using
 the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html





Build error

2015-01-30 Thread Andrew Musselman
Off master, got this error; is that typical?

---
 T E S T S
---
Running org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.495 sec -
in org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite

Results :




Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (test) @
spark-streaming-mqtt_2.10 ---
Discovery starting.
Discovery completed in 498 milliseconds.
Run starting. Expected test count is: 1
MQTTStreamSuite:
- mqtt input stream *** FAILED ***
  org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
progress
  at
org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:432)
  at
org.eclipse.paho.client.mqttv3.internal.ClientComms.internalSend(ClientComms.java:121)
  at
org.eclipse.paho.client.mqttv3.internal.ClientComms.sendNoWait(ClientComms.java:139)
  at org.eclipse.paho.client.mqttv3.MqttTopic.publish(MqttTopic.java:107)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:125)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:124)
  at scala.collection.immutable.Range.foreach(Range.scala:141)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.publishData(MQTTStreamSuite.scala:124)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply$mcV$sp(MQTTStreamSuite.scala:78)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply(MQTTStreamSuite.scala:66)
  ...
Exception in thread Thread-20 org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:690)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:689)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:689)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1384)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1319)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1250)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:510)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:485)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply$mcV$sp(MQTTStreamSuite.scala:59)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:210)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.runTest(MQTTStreamSuite.scala:38)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org
$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSuite.org
$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.apache.spark.streaming.mqtt.MQTTStreamSuite.org
$scalatest$BeforeAndAfter$$super$run(MQTTStreamSuite.scala:38)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.run(MQTTStreamSuite.scala:38)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
at
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
at
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Yeah okay, thanks.

 On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote:
 
 Pat, columnSimilarities is what that blog post is about, and is already part 
 of Spark 1.2.
 
 rowSimilarities in a RowMatrix is a little more tricky because you can't 
 transpose a RowMatrix easily, and is being tracked by this JIRA: 
 https://issues.apache.org/jira/browse/SPARK-4823
 
 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
 example the number of rows in your RowMatrix is less than 1m, you can 
 transpose it and use rowSimilarities.
 
 
 On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com wrote:
 BTW it looks like row and column similarities (cosine based) are coming to 
 MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
 master yet. Does anyone know the status?
 
 See: 
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
 
 Also the method for computation reduction (make it less than O(n^2)) seems 
 rooted in cosine. A different computation reduction method is used in the 
 Mahout code tied to LLR. Seems like we should get these together.
  
 On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Excellent, thanks Pat.
 
 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Mahout’s Spark implementation of rowsimilarity is in the Scala 
 SimilarityAnalysis class. It actually does either row or column similarity 
 but only supports LLR at present. It does [AA’] for columns or [A’A] for 
 rows first then calculates the distance (LLR) for non-zero elements. This 
 is a major optimization for sparse matrices. As I recall the old hadoop 
 code only did this for half the matrix since it’s symmetric but that 
 optimization isn’t in the current code because the downsampling is done as 
 LLR is calculated, so the entire similarity matrix is never actually 
 calculated unless you disable downsampling. 
 
 The primary use is for recommenders but I’ve used it (in the test suite) 
 for row-wise text token similarity too.  
 
 On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 
 and poking around to see how to do things.
 
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
 
 On Jan 17, 2015, at 8:35 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
 trying to accomplish.
 
  1.  It does give u pair-wise distances
  2.  U can specify the Distance measure u r looking to use
  3.  There's the old MapReduce impl and the Spark DSL impl per ur 
 preference.
 
 From: Andrew Musselman andrew.mussel...@gmail.com
 To: Reza Zadeh r...@databricks.com 
 Cc: user user@spark.apache.org 
 Sent: Saturday, January 17, 2015 11:29 AM
 Subject: Re: Row similarities
 
 Thanks Reza, interesting approach.  I think what I actually want is to 
 calculate pair-wise distance, on second thought.  Is there a pattern for 
 that?
 
 
 
 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:
 
 You can use K-means with a suitably large k. Each cluster should 
 correspond to rows that are similar to one another.
 
 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 What's a good way to calculate similarities between all vector-rows in a 
 matrix or RDD[Vector]?
 
 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
 going down a good path to transpose a matrix in order to run that.
 


Re: Row similarities

2015-01-17 Thread Andrew Musselman
Makes sense.

 On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote:
 
 We're focused on providing block matrices, which makes transposition simple: 
 https://issues.apache.org/jira/browse/SPARK-3434
 
 On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote:
 In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a 
 transpose—it’s optimized out. Mahout has had a stand alone row matrix 
 transpose since day 1 and supports it in the Spark version. Can’t really do 
 matrix algebra without it even though it’s often possible to optimize it 
 away. 
 
 Row similarity with LLR is much simpler than cosine since you only need 
 non-zero sums for column, row, and matrix elements so rowSimilarity is 
 implemented in Mahout for Spark. Full blown row similarity including all the 
 different similarity methods (long since implemented in hadoop mapreduce) 
 hasn’t been moved to spark yet.
 
 Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of 
 uses and can at very least be optimized for output matrix symmetry.
 
 On Jan 17, 2015, at 11:44 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Yeah okay, thanks.
 
 On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote:
 
 Pat, columnSimilarities is what that blog post is about, and is already 
 part of Spark 1.2.
 
 rowSimilarities in a RowMatrix is a little more tricky because you can't 
 transpose a RowMatrix easily, and is being tracked by this JIRA: 
 https://issues.apache.org/jira/browse/SPARK-4823
 
 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
 example the number of rows in your RowMatrix is less than 1m, you can 
 transpose it and use rowSimilarities.
 
 
 On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com 
 wrote:
 BTW it looks like row and column similarities (cosine based) are coming to 
 MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
 master yet. Does anyone know the status?
 
 See: 
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
 
 Also the method for computation reduction (make it less than O(n^2)) seems 
 rooted in cosine. A different computation reduction method is used in the 
 Mahout code tied to LLR. Seems like we should get these together.
  
 On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Excellent, thanks Pat.
 
 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Mahout’s Spark implementation of rowsimilarity is in the Scala 
 SimilarityAnalysis class. It actually does either row or column 
 similarity but only supports LLR at present. It does [AA’] for columns or 
 [A’A] for rows first then calculates the distance (LLR) for non-zero 
 elements. This is a major optimization for sparse matrices. As I recall 
 the old hadoop code only did this for half the matrix since it’s 
 symmetric but that optimization isn’t in the current code because the 
 downsampling is done as LLR is calculated, so the entire similarity 
 matrix is never actually calculated unless you disable downsampling. 
 
 The primary use is for recommenders but I’ve used it (in the test suite) 
 for row-wise text token similarity too. 
 
 On Jan 17, 2015, at 9:00 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 
 and poking around to see how to do things.
 
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
 
 On Jan 17, 2015, at 8:35 AM, Suneel Marthi suneel_mar...@yahoo.com 
 wrote:
 
 Andrew, u would be better off using Mahout's RowSimilarityJob for what u 
 r trying to accomplish.
 
  1.  It does give u pair-wise distances
  2.  U can specify the Distance measure u r looking to use
  3.  There's the old MapReduce impl and the Spark DSL impl per ur 
 preference.
 
 From: Andrew Musselman andrew.mussel...@gmail.com
 To: Reza Zadeh r...@databricks.com 
 Cc: user user@spark.apache.org 
 Sent: Saturday, January 17, 2015 11:29 AM
 Subject: Re: Row similarities
 
 Thanks Reza, interesting approach.  I think what I actually want is to 
 calculate pair-wise distance, on second thought.  Is there a pattern for 
 that?
 
 
 
 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:
 
 You can use K-means with a suitably large k. Each cluster should 
 correspond to rows that are similar to one another.
 
 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 What's a good way to calculate similarities between all vector-rows in 
 a matrix or RDD[Vector]?
 
 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure 
 I'm going down a good path to transpose a matrix in order to run that.
 


Re: Maven out of memory error

2015-01-17 Thread Andrew Musselman
Failing for me and another team member on the command line, for what it's worth.

 On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote:
 
 Hm, this test hangs for me in IntelliJ. It could be a real problem,
 and a combination of a) just recently actually enabling Java tests, b)
 recent updates to the complicated Guava shading situation.
 
 The manifestation of the error usually suggests that something totally
 failed to start (because of, say, class incompatibility errors, etc.)
 Thus things hang and time out waiting for the dead component. It's
 sometimes hard to get answers from the embedded component that dies
 though.
 
 That said, it seems to pass on the command line. For example my recent
 Jenkins job shows it passes:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25682/consoleFull
 
 I'll try to uncover more later this weekend. Thoughts welcome though.
 
 On Fri, Jan 16, 2015 at 8:26 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
 Thanks Ted, got farther along but now have a failing test; is this a known
 issue?
 
 ---
 T E S T S
 ---
 Running org.apache.spark.JavaAPISuite
 Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462 sec
  FAILURE! - in org.apache.spark.JavaAPISuite
 testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5 sec
  ERROR!
 org.apache.spark.SparkException: Job aborted due to stage failure: Master
 removed our application: FAILED
at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
 Running org.apache.spark.JavaJdbcRDDSuite
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec -
 in org.apache.spark.JavaJdbcRDDSuite
 
 Results :
 
 
 Tests in error:
  JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure:
 Maste...
 
 On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote:
 
 Can you try doing this before running mvn ?
 
 export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m
 
 What OS are you using ?
 
 Cheers
 
 On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
 
 Just got the latest from Github and tried running `mvn test`; is this
 error common and do you have any advice on fixing it?
 
 Thanks!
 
 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-core_2.10 ---
 [WARNING] Zinc server is not available at port 3030 - reverting to normal
 incremental compile
 [INFO] Using incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [INFO] Compiling 400 Scala sources and 34 Java sources to
 /home/akm/spark/core/target/scala-2.10/classes...
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
 imported `DataReadMethod' is permanently hidden by definition of object
 DataReadMethod in package executor
 [WARNING] import org.apache.spark.executor.DataReadMethod
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Thanks Reza, interesting approach.  I think what I actually want is to 
calculate pair-wise distance, on second thought.  Is there a pattern for that?

 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:
 
 You can use K-means with a suitably large k. Each cluster should correspond 
 to rows that are similar to one another.
 
 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 What's a good way to calculate similarities between all vector-rows in a 
 matrix or RDD[Vector]?
 
 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
 going down a good path to transpose a matrix in order to run that.
 


Re: Row similarities

2015-01-17 Thread Andrew Musselman
Excellent, thanks Pat.

 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Mahout’s Spark implementation of rowsimilarity is in the Scala 
 SimilarityAnalysis class. It actually does either row or column similarity 
 but only supports LLR at present. It does [AA’] for columns or [A’A] for rows 
 first then calculates the distance (LLR) for non-zero elements. This is a 
 major optimization for sparse matrices. As I recall the old hadoop code only 
 did this for half the matrix since it’s symmetric but that optimization isn’t 
 in the current code because the downsampling is done as LLR is calculated, so 
 the entire similarity matrix is never actually calculated unless you disable 
 downsampling. 
 
 The primary use is for recommenders but I’ve used it (in the test suite) for 
 row-wise text token similarity too.  
 
 On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:
 
 Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and 
 poking around to see how to do things.
 
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
 
 On Jan 17, 2015, at 8:35 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 
 Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
 trying to accomplish.
 
  1.  It does give u pair-wise distances
  2.  U can specify the Distance measure u r looking to use
  3.  There's the old MapReduce impl and the Spark DSL impl per ur preference.
 
 From: Andrew Musselman andrew.mussel...@gmail.com
 To: Reza Zadeh r...@databricks.com 
 Cc: user user@spark.apache.org 
 Sent: Saturday, January 17, 2015 11:29 AM
 Subject: Re: Row similarities
 
 Thanks Reza, interesting approach.  I think what I actually want is to 
 calculate pair-wise distance, on second thought.  Is there a pattern for 
 that?
 
 
 
 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:
 
 You can use K-means with a suitably large k. Each cluster should correspond 
 to rows that are similar to one another.
 
 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 What's a good way to calculate similarities between all vector-rows in a 
 matrix or RDD[Vector]?
 
 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
 going down a good path to transpose a matrix in order to run that.
 


Subscribe

2015-01-16 Thread Andrew Musselman



Maven out of memory error

2015-01-16 Thread Andrew Musselman
Just got the latest from Github and tried running `mvn test`; is this error
common and do you have any advice on fixing it?

Thanks!

[INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal
incremental compile
[INFO] Using incremental compilation
[INFO] compiler plugin:
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[INFO] Compiling 400 Scala sources and 34 Java sources to
/home/akm/spark/core/target/scala-2.10/classes...
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
imported `DataReadMethod' is permanently hidden by definition of object
DataReadMethod in package executor
[WARNING] import org.apache.spark.executor.DataReadMethod
[WARNING]  ^
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
match may not be exhaustive.
It would fail on the following input: TASK_ERROR
[WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
mesosState match {
[WARNING]  ^
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
method isDirectory in class FileSystem is deprecated: see corresponding
Javadoc for more information.
[WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
[WARNING] ^
[ERROR] PermGen space - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError


Re: Maven out of memory error

2015-01-16 Thread Andrew Musselman
Thanks Ted, got farther along but now have a failing test; is this a known
issue?

---
 T E S T S
---
Running org.apache.spark.JavaAPISuite
Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462
sec  FAILURE! - in org.apache.spark.JavaAPISuite
testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5 sec
 ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Running org.apache.spark.JavaJdbcRDDSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec -
in org.apache.spark.JavaJdbcRDDSuite

Results :


Tests in error:
  JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure:
Maste...

On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try doing this before running mvn ?

 export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m

 What OS are you using ?

 Cheers

 On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Just got the latest from Github and tried running `mvn test`; is this
 error common and do you have any advice on fixing it?

 Thanks!

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-core_2.10 ---
 [WARNING] Zinc server is not available at port 3030 - reverting to normal
 incremental compile
 [INFO] Using incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [INFO] Compiling 400 Scala sources and 34 Java sources to
 /home/akm/spark/core/target/scala-2.10/classes...
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
 imported `DataReadMethod' is permanently hidden by definition of object
 DataReadMethod in package executor
 [WARNING] import org.apache.spark.executor.DataReadMethod
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
 match may not be exhaustive.
 It would fail on the following input: TASK_ERROR
 [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
 mesosState match {
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
 method isDirectory in class FileSystem is deprecated: see corresponding
 Javadoc for more information.
 [WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
 [WARNING] ^
 [ERROR] PermGen space - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError





Re: Maven out of memory error

2015-01-16 Thread Andrew Musselman
Thanks Sean

On Fri, Jan 16, 2015 at 12:06 PM, Sean Owen so...@cloudera.com wrote:

 Hey Andrew, you'll want to have a look at the Spark docs on building:
 http://spark.apache.org/docs/latest/building-spark.html

 It's the first thing covered there.

 The warnings are normal as you are probably building with newer Hadoop
 profiles and so old-Hadoop support code shows deprecation warnings on
 its use of old APIs.

 On Fri, Jan 16, 2015 at 8:03 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
  Just got the latest from Github and tried running `mvn test`; is this
 error
  common and do you have any advice on fixing it?
 
  Thanks!
 
  [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
  spark-core_2.10 ---
  [WARNING] Zinc server is not available at port 3030 - reverting to normal
  incremental compile
  [INFO] Using incremental compilation
  [INFO] compiler plugin:
  BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
  [INFO] Compiling 400 Scala sources and 34 Java sources to
  /home/akm/spark/core/target/scala-2.10/classes...
  [WARNING]
 
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
  imported `DataReadMethod' is permanently hidden by definition of object
  DataReadMethod in package executor
  [WARNING] import org.apache.spark.executor.DataReadMethod
  [WARNING]  ^
  [WARNING]
  /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
  match may not be exhaustive.
  It would fail on the following input: TASK_ERROR
  [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
  mesosState match {
  [WARNING]  ^
  [WARNING]
 
 /home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
  method isDirectory in class FileSystem is deprecated: see corresponding
  Javadoc for more information.
  [WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
  [WARNING] ^
  [ERROR] PermGen space - [Help 1]
  [ERROR]
  [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e
  switch.
  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
  [ERROR]
  [ERROR] For more information about the errors and possible solutions,
 please
  read the following articles:
  [ERROR] [Help 1]
  http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
 



Row similarities

2015-01-16 Thread Andrew Musselman
What's a good way to calculate similarities between all vector-rows in a
matrix or RDD[Vector]?

I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
going down a good path to transpose a matrix in order to run that.