Re: Row similarities

2015-01-17 Thread Reza Zadeh
Pat, columnSimilarities is what that blog post is about, and is already part of Spark 1.2. rowSimilarities in a RowMatrix is a little more tricky because you can't transpose a RowMatrix easily, and is being tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Andrew, sometimes

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Michael Armbrust
1) The fields in the SELECT clause are not pushed down to the predicate pushdown API. I have many optimizations that allow fields to be filtered out before the resulting object is serialized on the Accumulo tablet server. How can I get the selection information from the execution plan? I'm a

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-17 Thread Su She
Thanks Akhil and Sean for the responses. I will try shutting down spark, then storage and then the instances. Initially, when hdfs was in safe mode, I waited for 1 hour and the problem still persisted. I will try this new method. Thanks! On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
I see now. It optimizes the selection semantics so that less things need to be included just to do a count(). Very nice. I did a collect() instead of a count just to see what would happen and it looks like the all the expected select fields were propagated down as expected. Thanks. On Sat,

Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread Nathan Murthy
Originally posted here: http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

Re: Join DStream With Other Datasets

2015-01-17 Thread Jörn Franke
Can't you send a special event through spark streaming once the list is updated? So you have your normal events and a special reload event Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit : Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam

Re: Performance issue

2015-01-17 Thread TJ Klein
I suspect that putting a function into shared variable incurs additional overhead? Any suggestion how to avoid that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194p21210.html Sent from the Apache Spark User List mailing list

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Yeah okay, thanks. On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote: Pat, columnSimilarities is what that blog post is about, and is already part of Spark 1.2. rowSimilarities in a RowMatrix is a little more tricky because you can't transpose a RowMatrix easily, and

maven doesn't build dependencies with Scala 2.11

2015-01-17 Thread Walrus theCat
Hi, When I run this: dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as per here https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211, maven doesn't build Spark's dependencies. Only when I run:

Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
Hi, This is because ssh-ready is the ec2 scripy means that all the instances are in the status of running and all the instances in the status of OK, In another word, the instances is ready to download and to install software, just as emr is ready for bootstrap actions. Before, the script just

Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-17 Thread Sunita Arvind
Hi, My spark jobs suddenly started getting hung and here is the debug leading to it: Following the program, it seems to be stuck whenever I do any collect(), count or rdd.saveAsParquet file. AFAIK, any operation that requires data flow back to master causes this. I increased the memory to 5 MB.

Directory / File Reading Patterns

2015-01-17 Thread Steve Nunez
Hello Users, I've got a real-world use case that seems common enough that its pattern would be documented somewhere, but I can't find any references to a simple solution. The challenge is that data is getting dumped into a directory structure, and that directory structure itself contains

Re: Row similarities

2015-01-17 Thread Pat Ferrel
BTW it looks like row and column similarities (cosine based) are coming to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the master yet. Does anyone know the status? See: https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause but rather just the columns explicitly included in the WHERE clause. Examples from my testing: SELECT * FROM myTable -- The required columns are

Re: Row similarities

2015-01-17 Thread Pat Ferrel
In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a transpose—it’s optimized out. Mahout has had a stand alone row matrix transpose since day 1 and supports it in the Spark version. Can’t really do matrix algebra without it even though it’s often possible to optimize it away.

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Michael Armbrust
How are you running your test here? Are you perhaps doing a .count()? On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote: Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause

Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com wrote: My mails to the mailing list are getting rejected, have opened a

Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Soumya Simanta
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache Spark on Google Compute Engine. I can run Spark jobs on the REPL and read data from Google storage. However, I'm not sure how to access the Spark UI in this deployment. Can anyone help? Also, it deploys Spark 1.1. It there an

Re: maven doesn't build dependencies with Scala 2.11

2015-01-17 Thread Ted Yu
There're 3 jars under lib_managed/jars directory with and without -Dscala-2.11 flag. Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10 profile has the following: modules moduleexternal/kafka/module /modules FYI On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu

Re: Bouncing Mails

2015-01-17 Thread Akhil Das
Yep. They have sorted it out it seems. On 18 Jan 2015 03:58, Patrick Wendell pwend...@gmail.com wrote: Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM,

Re: maven doesn't build dependencies with Scala 2.11

2015-01-17 Thread Ted Yu
I did the following: 1655 dev/change-version-to-2.11.sh 1657 mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package And mvn command passed. Did you see any cross-compilation errors ? Cheers BTW the two links you mentioned are consistent in terms of

How to get the master URL at runtime inside driver program?

2015-01-17 Thread guxiaobo1982
Hi, Driver programs submitted by the spark-submit script will get the runtime spark master URL, but how it get the URL inside the main method when creating the SparkConf object? Regards,

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Makes sense. On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote: We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434 On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote: In the

Re: Row similarities

2015-01-17 Thread Reza Zadeh
We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434 On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote: In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a transpose—it’s optimized

Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest asking in the GCE support forum. You could also try to launch a Spark cluster by hand on nodes in there. Sigmoid Analytics published a package for this here: http://spark-packages.org/package/9 Matei On Jan 17,

Re: Multiple Spark Streaming receiver model

2015-01-17 Thread aglowik
I'm new to Spark. From my experience when I use asingle StreamingContext to create different input streams from different sources I get multiple errors and problems down stream. This seems like it is not the way to go. From what I read creating multiple StreamingContext is not advised. It appears

Not able to run spark job from code on EC2 with spark 1.2.0

2015-01-17 Thread rahulkumar-aws
Hi I am trying to run simple count on a s3 bucket, but with spark 1.2.0 version on EC2 it is not able to run. I started my cluster using ec2 script that came with spark 1.2.0. some part of code : It is working with spark 1.1.1 , but not with 1.2.0 - Software Developer

Re: SparkSQL 1.2.0 sources API error

2015-01-17 Thread Walrus theCat
I'm getting this also, with Scala 2.11 and Scala 2.10: 15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/01/18 07:34:51 INFO Remoting: Starting remoting 15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread

Spark attempts to de/serialize using JavaSerializer despite being configured to use Kryo

2015-01-17 Thread waymost
I'm new to Spark and have run into issues using Kryo for serialization instead of Java. I have my SparkConf configured as such: val conf = new SparkConf().setMaster(local).setAppName(test) .set(spark.kryo.registrationRequired,false) .set(spark.serializer,

spark-submit --py-files remote: Only local additional python files are supported

2015-01-17 Thread voukka
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding

Re: Is spark suitable for large scale pagerank, such as 200 millionnodes, 2 billion edges?

2015-01-17 Thread txw
I’ve read these pages. In the paper GraphX: Graph Processing in a Distributed Dataflow Framework “, the authors claim that it only takes 400 seconds for uk-2007-05 dataset, which is similar size as my dateset. Is the current Graphx the same version as the Graphx in that paper? And how many

Bouncing Mails

2015-01-17 Thread Akhil Das
My mails to the mailing list are getting rejected, have opened a Jira issue, can someone take a look at it? https://issues.apache.org/jira/browse/INFRA-9032 Thanks Best Regards

Re: Futures timed out during unpersist

2015-01-17 Thread Akhil Das
What is the data size? Have you tried increasing the driver memory?? Thanks Best Regards On Sat, Jan 17, 2015 at 1:01 PM, Kevin (Sangwoo) Kim kevin...@apache.org wrote: Hi experts, I got an error during unpersist RDD. Any ideas? java.util.concurrent.TimeoutException: Futures timed out

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-17 Thread Akhil Das
Safest way would be to first shutdown HDFS and then shutdown Spark (call stop-all.sh would do) and then shutdown the machines. You can execute the following command to disable safe mode: *hadoop fs -safemode leave* Thanks Best Regards On Sat, Jan 17, 2015 at 8:31 AM, Su She

Re: Problem with File Streams

2015-01-17 Thread Akhil Das
Try: JavaPairDStreamString, String foo = ssc.String, String, SequenceFileInputFormatfileStream(/sigmoid/foo); Thanks Best Regards On Sat, Jan 17, 2015 at 4:24 AM, Leonidas Fegaras fega...@cse.uta.edu wrote: Dear Spark users, I have a problem using File Streams in Java on Spark 1.2.0. I can

No Output

2015-01-17 Thread Deep Pradhan
Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone

Re: remote Akka client disassociated - some timeout?

2015-01-17 Thread Akhil Das
​Try setting the following property: .set(spark.akka.frameSize,50)​ Also make sure that spark is able read from hbase (you can try it with small amount data). Thanks Best Regards On Fri, Jan 16, 2015 at 11:30 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I believe this is some

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-17 Thread Cheng, Hao
Wow, glad to know that it works well, and sorry, the Jira is another issue, which is not the same case here. From: Bagmeet Behera [mailto:bagme...@gmail.com] Sent: Saturday, January 17, 2015 12:47 AM To: Cheng, Hao Subject: Re: using hiveContext to select a nested Map-data-type from an

Re: No Output

2015-01-17 Thread Akhil Das
Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-17 Thread Sean Owen
You would not want to turn off storage underneath Spark. Shut down Spark first, then storage, then shut down the instances. Reverse the order when restarting. HDFS will be in safe mode for a short time after being started before it becomes writeable. I would first check that it's not just that.

Spark Streaming

2015-01-17 Thread Rohit Pujari
Hello Folks: I'm running into following error while executing relatively straight forward spark-streaming code. Am I missing anything? *Exception in thread main java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute* Code: val conf = new

Re: Spark Streaming

2015-01-17 Thread Akhil Das
You need to trigger some action (stream.print(), stream.foreachRDD, stream.saveAs*) over the stream that you created for the entire pipeline to execute. In your code add the following line: *unifiedStream.print()* Thanks Best Regards On Sat, Jan 17, 2015 at 3:35 PM, Rohit Pujari

Re: Spark Streaming

2015-01-17 Thread Rohit Pujari
Hi Francois: I tried using print(kafkaStream) as output operator but no luck. It throws the same error. Any other thoughts? Thanks, Rohit From: francois.garil...@typesafe.commailto:francois.garil...@typesafe.com francois.garil...@typesafe.commailto:francois.garil...@typesafe.com Date:

Re: Spark Streaming

2015-01-17 Thread Sean Owen
Not print(kafkaStream), which would just print some String description of the stream to the console, but kafkaStream.print(), which actually invokes the print operation on the stream. On Sat, Jan 17, 2015 at 10:17 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hi Francois: I tried using

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-17 Thread Sean Owen
I'm not sure how you are setting these values though. Where is spark.yarn.executor.memoryOverhead=6144 ? Env variables aren't the best way to set configuration either. Again have a look at http://spark.apache.org/docs/latest/running-on-yarn.html ... --executor-memory 22g --conf

Re: Spark Streaming

2015-01-17 Thread Rohit Pujari
That was it. Thanks Akhil and Owen for your quick response. On Sat, Jan 17, 2015 at 4:27 AM, Sean Owen so...@cloudera.com wrote: Not print(kafkaStream), which would just print some String description of the stream to the console, but kafkaStream.print(), which actually invokes the print

Re: Maven out of memory error

2015-01-17 Thread Sean Owen
Hm, this test hangs for me in IntelliJ. It could be a real problem, and a combination of a) just recently actually enabling Java tests, b) recent updates to the complicated Guava shading situation. The manifestation of the error usually suggests that something totally failed to start (because of,

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues. On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List] ml-node+s1001560n21197...@n3.nabble.com wrote: Is spark 1.2 is compatibly with HDP 2.1 -- If you reply to this email,

Re: remote Akka client disassociated - some timeout?

2015-01-17 Thread Ted Yu
Antony: Please check hbase master log to see if there was something noticeable in that period of time. If the hbase cluster is not big, check region server log as well. Cheers On Jan 16, 2015, at 10:00 AM, Antony Mayi antonym...@yahoo.com.INVALID wrote: Hi, I believe this is some

Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread pzecevic
Hi, guys! I'm reviving this old question from Nick Chammas with a new proposal: what do you think about creating a separate Stack Exchange 'Apache Spark' site (like 'philosophy' and 'English' etc.)? I'm not sure what would be the best way to deal with user and dev lists, though - to merge them

Error occurs when running Spark SQL example

2015-01-17 Thread bit1...@163.com
When I run the following spark sql example within Idea, I got the StackOverflowError, lookes like the scala.util.parsing.combinator.Parsers are calling recursively and infinitely. Anyone encounters this? package spark.examples import org.apache.spark.{SparkContext, SparkConf} import

[no subject]

2015-01-17 Thread Kyounghyun Park
Hi, I'm running Spark 1.2 in yarn-client mode. (using Hadoop 2.6.0) On VirtualBox, I can run spark-shell --master yarn-client without any error However, on a physical machine, I got the following error. Does anyone know why this happens? Any help would be appreciated. Thanks, Kyounghyun

spark error in yarn-client mode

2015-01-17 Thread Kyounghyun Park
Hi, I'm running Spark 1.2 in yarn-client mode. (using Hadoop 2.6.0) On VirtualBox, I can run spark-shell --master yarn-client without any error However, on a physical machine, I got the following error. Does anyone know why this happens? Any help would be appreciated. Thanks, Kyounghyun

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-17 Thread Antony Mayi
the values are for sure applied as expected - confirmed using the spark UI environment page... it comes from my defaults configured using 'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even more) in  /etc/spark/conf/spark-defaults.conf and 'export SPARK_EXECUTOR_MEMORY=24G' in 

Re: Futures timed out during unpersist

2015-01-17 Thread Kevin (Sangwoo) Kim
data size is about 300~400GB, I'm using 800GB cluster and set driver memory to 50GB. On Sat Jan 17 2015 at 6:01:46 PM Akhil Das ak...@sigmoidanalytics.com wrote: What is the data size? Have you tried increasing the driver memory?? Thanks Best Regards On Sat, Jan 17, 2015 at 1:01 PM, Kevin

Is cluster mode is supported by the submit command for standalone clusters?

2015-01-17 Thread guxiaobo1982
Hi, The submitting applications guide in http://spark.apache.org/docs/latest/submitting-applications.html says: Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network

Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread Andrew Ash
People can continue using the stack exchange sites as is with no additional work from the Spark team. I would not support migrating our mailing lists yet again to another system like Discourse because I fear fragmentation of the community between the many sites. On Sat, Jan 17, 2015 at 6:24 AM,

Join DStream With Other Datasets

2015-01-17 Thread Ji ZHANG
Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual. But the problem is the spam ip list will be updated

Re: spark 1.2 compatibility

2015-01-17 Thread bhavyateja
Hi Did you try using spark 1.2 on hdp 2.1 YARN Can you please go thru the thread http://apache-spark-user-list.1001560.n3.nabble.com/Troubleshooting-Spark-tt21189.html and check where I am going wrong. As my word count program is erroring out when using spark 1.2 using YARN but its getting

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-17 Thread Yana Kadiyska
Just wondering if you've made any progress on this -- I'm having the same issue. My attempts to help myself are documented here http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E . I don't believe I have the

Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread Nicholas Chammas
The Stack Exchange community will not support creating a whole new site just for Spark (otherwise you’d see dedicated sites for much larger topics like “Python”). Their tagging system works well enough to separate questions about different topics, and the apache-spark

Re: Problem with File Streams

2015-01-17 Thread Leonidas Fegaras
My key/value classes are custom serializable classes. It looks like a bug. So I filed it on JIRA as SPARK-5297 Thanks Leonidas On 01/17/2015 03:07 AM, Akhil Das wrote: Try: JavaPairDStreamString, String foo = ssc.String, String, SequenceFileInputFormatfileStream(/sigmoid/foo);

Re: spark 1.2 compatibility

2015-01-17 Thread bhavyateja
Hi all, Thanks for your contribution. We have checked and confirmed that HDP 2.1 YARN don't work with Spark 1.2 On Sat, Jan 17, 2015 at 9:11 AM, bhavya teja potineni bhavyateja.potin...@gmail.com wrote: Hi Did you try using spark 1.2 on hdp 2.1 YARN Can you please go thru the thread

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
It worked for me. spark 1.2.0 with hadoop 2.2.0 On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via Apache Spark User List] ml-node+s1001560n21207...@n3.nabble.com wrote: Hi all, Thanks for your contribution. We have checked and confirmed that HDP 2.1 YARN don't work with Spark 1.2 On Sat,

Re: Maven out of memory error

2015-01-17 Thread Andrew Musselman
Failing for me and another team member on the command line, for what it's worth. On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote: Hm, this test hangs for me in IntelliJ. It could be a real problem, and a combination of a) just recently actually enabling Java tests, b) recent

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
I did an initial implementation. There are two assumptions i had from the start that I was very surprised were not a part of the predicate pushdown API: 1) The fields in the SELECT clause are not pushed down to the predicate pushdown API. I have many optimizations that allow fields to be filtered

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Thanks Reza, interesting approach. I think what I actually want is to calculate pair-wise distance, on second thought. Is there a pattern for that? On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote: You can use K-means with a suitably large k. Each cluster should correspond

Re: Row similarities

2015-01-17 Thread Suneel Marthi
Andrew, u would be better off using Mahout's RowSimilarityJob for what u r trying to accomplish.  1.  It does give u pair-wise distances 2.  U can specify the Distance measure u r looking to use 3.  There's the old MapReduce impl and the Spark DSL impl per ur preference. From: Andrew

Re: spark 1.2 compatibility

2015-01-17 Thread bhavyateja
Yes it works with 2.2 but we are trying to use spark 1.2 on HDP 2.1 On Sat, Jan 17, 2015, 11:18 AM Chitturi Padma [via Apache Spark User List] ml-node+s1001560n21208...@n3.nabble.com wrote: It worked for me. spark 1.2.0 with hadoop 2.2.0 On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via

Re: Maven out of memory error

2015-01-17 Thread Ted Yu
The test passed here: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull It passed locally with the following command: mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test -Dtest=JavaAPISuite FYI

Re: Row similarities

2015-01-17 Thread Pat Ferrel
Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class. It actually does either row or column similarity but only supports LLR at present. It does [AA’] for columns or [A’A] for rows first then calculates the distance (LLR) for non-zero elements. This is a major

Re: Row similarities

2015-01-17 Thread Andrew Musselman
Excellent, thanks Pat. On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote: Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class. It actually does either row or column similarity but only supports LLR at present. It does [AA’] for