Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote: Now I'm wondering where this comes from (I haven't touched this component in a while, nor upgraded Spark etc.) [...] So the reason that the error is showing up now is that suddenly data from a different

RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
Hi Tobias, Can you provide how you create the JsonRDD? Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, January 16, 2015 4:01 PM To: user Subject: Re: MatchError in JsonRDD.toLong Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer

Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Can you provide how you create the JsonRDD? This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc)

RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
The second parameter of jsonRDD is the sampling ratio when we infer schema. Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, January 16, 2015 5:11 PM To: Wang, Daoyuan Cc: user Subject: Re: MatchError in JsonRDD.toLong Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang,

RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
And you can use jsonRDD(json:RDD[String], schema:StructType) to clearly clarify your schema. For numbers later than Long, we can use DecimalType. Thanks, Daoyuan From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Friday, January 16, 2015 5:14 PM To: Tobias Pfeiffer Cc: user Subject: RE:

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork Sail
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings:

Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread Xiaoyu Wang
Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT

terasort on spark

2015-01-16 Thread lonely Feb
Hi all , i tried to run a terasort benchmark on my spark cluster, but i found it is hard to find a standard spark terasort program except a PR from rxin and ewan higgs: https://github.com/apache/spark/pull/1242 https://github.com/ehiggs/spark/tree/terasort The example which rxin provided without

Re: How to define SparkContext with Cassandra connection for spark-jobserver?

2015-01-16 Thread Sasi
Thank you Abhishek. The code works. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-define-SparkContext-with-Cassandra-connection-for-spark-jobserver-tp21119p21184.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Julian Ricardo
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings:

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def

Fwd: Problems with TeraValidate

2015-01-16 Thread lonely Feb
+spark-user -- Forwarded message -- From: lonely Feb lonely8...@gmail.com Date: 2015-01-16 19:09 GMT+08:00 Subject: Re: Problems with TeraValidate To: Ewan Higgs ewan.hi...@ugent.be thx a lot. btw, here is my output: 1. when dataset is 1000g: num records: 100 checksum:

Re: Spark SQL Parquet - data are reading very very slow

2015-01-16 Thread Mick Davies
I have seen similar results. I have looked at the code and I think there are a couple of contributors: Encoding/decoding java Strings to UTF8 bytes is quite expensive. I'm not sure what you can do about that. But there are options for optimization due to the repeated decoding of the same String

Re: MissingRequirementError with spark

2015-01-16 Thread sarsol
Created JIRA issue https://issues.apache.org/jira/browse/SPARK-5281 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21188.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executor vs Mapper in Hadoop

2015-01-16 Thread Shuai Zheng
Thanks a lot for clarify it. Then following there are some questions: 1, if normally we have 1 executor per machine. Then if we have a cluster with different hardware capacity, for example: one 8 core worker and one 4 core worker (ignore the driver machine), then if we set executor-cores =4, then

RE: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread yana
I think you might need to set  spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: Xiaoyu Wang wangxy...@gmail.com /divdivDate:01/16/2015 5:09 AM

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread Xiaoyu Wang
Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4.

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
Does parallel processing mean it is executed in multiple worker or executed in one worker but multiple threads? For example if I have only one worker but my RDD has 4 partition, will it be executed parallel in 4 thread? The reason I am asking is try to decide whether I need to configure spark

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I have asked this question before but get no answer. Asking again. Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1) val rdd2 =

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
So one worker is enough and it will use all 4 cores? In what situation shall I configure more workers in my single node cluster? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent:

How to 'Pipe' Binary Data in Apache Spark

2015-01-16 Thread Nick Allen
I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program. This code is representative

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Sean Owen
On Fri, Jan 16, 2015 at 9:58 AM, Zork Sail zorks...@gmail.com wrote: And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: This is likely the problem. RMSE is not an appropriate

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-16 Thread Sean Owen
Well it looks like you're reading some kind of binary file as text. That isn't going to work, in Spark or elsewhere, as binary data is not even necessarily the valid encoding of a string. There are no line breaks to delimit lines and thus elements of the RDD. Your input has some record structure

Re: Scala vs Python performance differences

2015-01-16 Thread philpearl
I was interested in this as I had some Spark code in Python that was too slow and wanted to know whether Scala would fix it for me. So I re-wrote my code in Scala. In my particular case the Scala version was 10 times faster. But I think that is because I did an awful lot of computation in my

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-16 Thread Nick Allen
Per your last comment, it appears I need something like this: https://github.com/RIPE-NCC/hadoop-pcap Thanks a ton. That get me oriented in the right direction. On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote: Well it looks like you're reading some kind of binary file

kinesis multiple records adding into stream

2015-01-16 Thread Hafiz Mujadid
Hi Experts! I am using kinesis dependency as follow groupId = org.apache.spark artifactId = spark-streaming-kinesis-asl_2.10 version = 1.2.0 in this aws sdk version 1.8.3 is being used. in this sdk multiple records can not be put in a single request. is it possible to put multiple records in a

Re: kinesis multiple records adding into stream

2015-01-16 Thread Aniket Bhatnagar
Sorry. I couldn't understand the issue. Are you trying to send data to kinesis from a spark batch/real time job? - Aniket On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts! I am using kinesis dependency as follow groupId = org.apache.spark artifactId =

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Imran Rashid
I'm not positive, but I think this is very unlikely to work. First, when you call sc.objectFile(...), I think the *driver* will need to know something about the file, eg to know how many tasks to create. But it won't even be able to see the file, since it only lives on the local filesystem of

remote Akka client disassociated - some timeout?

2015-01-16 Thread Antony Mayi
Hi, I believe this is some kind of timeout problem but can't figure out how to increase it. I am running spark 1.2.0 on yarn (all from cdh 5.3.0). I submit a python task which first loads big RDD from hbase - I can see in the screen output all executors fire up then no more logging output for

RE: Determine number of running executors

2015-01-16 Thread Shuai Zheng
Hi Tobias, Can you share more information about how do you do that? I also have similar question about this. Thanks a lot, Regards, Shuai From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Wednesday, November 26, 2014 12:25 AM To: Sandy Ryza Cc: Yanbo Liang; user Subject:

Re: kinesis multiple records adding into stream

2015-01-16 Thread Kelly, Jonathan
Are you referring to the PutRecords method, which was added in 1.9.9? (See http://aws.amazon.com/releasenotes/1369906126177804) If so, can't you just depend upon this later version of the SDK in your app even though spark-streaming-kinesis-asl is depending upon this earlier 1.9.3 version that

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I need to save RDD to file system and then restore my RDD from the file system in the future. I don’t have any hdfs file system and don’t want to go the hassle of setting up a hdfs system. So how can I achieve this? The application need to be run on a cluster with multiple nodes. Regards,

Re: Failing jobs runs twice

2015-01-16 Thread Anders Arpteg
FYI, I just confirmed with the latest Spark 1.3 snapshot that the spark.yarn.maxAppAttempts setting that SPARK-2165 refers to works perfectly. Great to finally get rid of this problem. Also caused an issue when the eventLogs were enabled since the spark-events/appXXX folder already exists the

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-16 Thread Nick Allen
I just wanted to reiterate the solution for the benefit of the community. The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward. 1. Implement a custom 'InputFormat' that understands the binary input

Re: Scala vs Python performance differences

2015-01-16 Thread Davies Liu
Hey Phil, Thank you sharing this. The result didn't surprise me a lot, it's normal to do the prototype in Python, once it get stable and you really need the performance, then rewrite part of it in C or whole of it in another language does make sense, it will not cause you much time. Davies On

spark java options

2015-01-16 Thread Kane Kim
I want to add some java options when submitting application: --conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder But looks like it doesn't get set. Where I can add it to make it working? Thanks.

Re: spark java options

2015-01-16 Thread Marcelo Vanzin
Hi Kane, What's the complete command line you're using to submit the app? Where to you expect these options to appear? On Fri, Jan 16, 2015 at 11:12 AM, Kane Kim kane.ist...@gmail.com wrote: I want to add some java options when submitting application: --conf

Re: spark java options

2015-01-16 Thread Marcelo Vanzin
Hi Kane, Here's the command line you sent me privately: ./spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class SimpleApp --conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder --master local simpleapp.jar ./test.log You're running the app in local mode. In that

Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread olegshirokikh
The question is about the ways to create a Windows desktop-based and/or web-based application client that is able to connect and talk to the server containing Spark application (either local or on-premise cloud distributions) in the run-time. Any language/architecture may work. So far, I've seen

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet
There's also an example of running a SparkContext in a java servlet container from Calrissian: https://github.com/calrissian/spark-jetty-server On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote: The question is about the ways to create a Windows desktop-based and/or

Subscribe

2015-01-16 Thread Andrew Musselman

Re: Subscribe

2015-01-16 Thread Ted Yu
Send email to user-subscr...@spark.apache.org Cheers On Fri, Jan 16, 2015 at 11:51 AM, Andrew Musselman andrew.mussel...@gmail.com wrote:

Maven out of memory error

2015-01-16 Thread Andrew Musselman
Just got the latest from Github and tried running `mvn test`; is this error common and do you have any advice on fixing it? Thanks! [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-core_2.10 --- [WARNING] Zinc server is not available at port 3030 - reverting to normal

Re: Maven out of memory error

2015-01-16 Thread Ted Yu
Can you try doing this before running mvn ? export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m What OS are you using ? Cheers On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Just got the latest from Github and tried running `mvn

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Robert C Senkbeil
Hi, You can take a look at the Spark Kernel project: https://github.com/ibm-et/spark-kernel The Spark Kernel's goal is to serve as the foundation for interactive applications. The project provides a client library in Scala that abstracts connecting to the kernel (containing a Spark Context),

Re: Maven out of memory error

2015-01-16 Thread Andrew Musselman
Thanks Ted, got farther along but now have a failing test; is this a known issue? --- T E S T S --- Running org.apache.spark.JavaAPISuite Tests run: 72, Failures: 0, Errors: 1, Skipped: 0,

Re: Maven out of memory error

2015-01-16 Thread Andrew Musselman
Thanks Sean On Fri, Jan 16, 2015 at 12:06 PM, Sean Owen so...@cloudera.com wrote: Hey Andrew, you'll want to have a look at the Spark docs on building: http://spark.apache.org/docs/latest/building-spark.html It's the first thing covered there. The warnings are normal as you are probably

Performance issue

2015-01-16 Thread TJ Klein
Hi, I observed some weird performance issue using Spark in combination with Theano, and I have no real explanation for that. To exemplify the issue I am using the pi.py example of spark that computes pi: When I modify the function from the example: #unmodified code def f(_): x =

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-16 Thread Michael Armbrust
You can get the internal RDD using: schemaRDD.queryExecution.toRDD. This is used internally and does not copy. This is an unstable developer API. On Thu, Jan 15, 2015 at 11:26 PM, Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Thanks Cheng! Is there any API I can get access too

Re: Maven out of memory error

2015-01-16 Thread Ted Yu
I got the same error: testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 261.111 sec ERROR! org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org

Cluster Aware Custom RDD

2015-01-16 Thread Jim Carroll
Hello all, I have a custom RDD for fast loading of data from a non-partitioned source. The partitioning happens in the RDD implementation by pushing data from the source into queues picked up by the current active partitions in worker threads. This works great on a multi-threaded single host

Spark random forest - string data

2015-01-16 Thread Asaf Lahav
Hi, I have been playing around with the new version of Spark MLlib Random forest implementation, and while in the process, tried it with a file with String Features. While training, it fails with: java.lang.NumberFormatException: For input string. Is MBLib Random forest adapted to run on top of

RE: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Oleg Shirokikh
Thanks a lot, Robert – I’ll definitely investigate this and probably would come back with questions. P.S. I’m new to this Spark forum. I’m getting responses through emails but they are not appearing as “replies” in the thread – it’s kind of inconvenient. Is it something that I should tweak?

Re: Spark random forest - string data

2015-01-16 Thread Sean Owen
The implementation accepts an RDD of LabeledPoint only, so you couldn't feed in strings from a text file directly. LabeledPoint is a wrapper around double values rather than strings. How were you trying to create the input then? No, it only accepts numeric values, although you can encode

Re: Using Spark SQL with multiple (avro) files

2015-01-16 Thread Michael Armbrust
I'd open an issue on the github to ask us to allow you to use hadoops glob file format for the path. On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com wrote: I've tried this now. Spark can load multiple avro files from the same directory by passing a path to a directory.

Re: Spark random forest - string data

2015-01-16 Thread Nick Allen
An alternative approach would be to translate your categorical variables into dummy variables. If your strings represent N classes/categories you would generate N-1 dummy variables containing 0/1 values. Auto-magically creating dummy variables from categorical data definitely comes in handy. I

Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg
Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields Would this be of interest? It's not open source, but if there's sufficient demand I

spark 1.2 compatibility

2015-01-16 Thread bhavyateja
Is spark 1.2 is compatibly with HDP 2.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Processing .wav files in PySpark

2015-01-16 Thread Davies Liu
I think you can not use textFile() or binaryFile() or pickleFile() here, it's different format than wav. You could get a list of paths for all the files, then sc.parallelize(), and foreach(): def process(path): # use subprocess to launch a process to do the job, read the stdout as result

Row similarities

2015-01-16 Thread Andrew Musselman
What's a good way to calculate similarities between all vector-rows in a matrix or RDD[Vector]? I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm going down a good path to transpose a matrix in order to run that.

RE: spark 1.2 compatibility

2015-01-16 Thread Judy Nash
Yes. It's compatible with HDP 2.1 -Original Message- From: bhavyateja [mailto:bhavyateja.potin...@gmail.com] Sent: Friday, January 16, 2015 3:17 PM To: user@spark.apache.org Subject: spark 1.2 compatibility Is spark 1.2 is compatibly with HDP 2.1 -- View this message in context:

RE: spark 1.2 compatibility

2015-01-16 Thread Judy Nash
Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not seen a problem. However officially HDP 2.1 + Spark 1.2 is not a supported scenario. -Original Message- From: Judy Nash Sent: Friday, January 16, 2015 5:35 PM To: 'bhavyateja'; user@spark.apache.org

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get support from HDP (if you have that). Matei On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not seen a

HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-16 Thread Su She
Hello Everyone, I am encountering trouble running Spark applications when I shut down my EC2 instances. Everything else seems to work except Spark. When I try running a simple Spark application, like sc.parallelize() I get the message that hdfs name node is in safemode. Has anyone else had this

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet
Hao, Thanks so much for the links! This is exactly what I'm looking for. If I understand correctly, I can extend PrunedFilteredScan, PrunedScan, and TableScan and I should be able to support all the sql semantics? I'm a little confused about the Array[Filter] that is used with the Filtered scan.

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das
Hi Dib, For our usecase I want my spark job1 to read from hdfs/cache and write to kafka queues. Similarly spark job2 should read from kafka queues and write to kafka queues. Is writing to kafka queues from spark job supported in your code ? Thanks Deb On Jan 15, 2015 11:21 PM, Akhil Das

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-16 Thread Antony Mayi
although this helped to improve it significantly I still run into this problem despite increasing the spark.yarn.executor.memoryOverhead vastly: export SPARK_EXECUTOR_MEMORY=24Gspark.yarn.executor.memoryOverhead=6144 yet getting this:2015-01-17 04:47:40,389 WARN

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Dibyendu Bhattacharya
My code handles the Kafka Consumer part. But writing to Kafka may not be a big challenge which you can easily do in your driver code. dibyendu On Sat, Jan 17, 2015 at 9:43 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Dib, For our usecase I want my spark job1 to read from hdfs/cache

Re: Maven out of memory error

2015-01-16 Thread Ted Yu
I tried the following but still didn't see test output :-( diff --git a/pom.xml b/pom.xml index f4466e5..dae2ae8 100644 --- a/pom.xml +++ b/pom.xml @@ -1131,6 +1131,7 @@ spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts /systemProperties

Re: Row similarities

2015-01-16 Thread Reza Zadeh
You can use K-means https://spark.apache.org/docs/latest/mllib-clustering.html with a suitably large k. Each cluster should correspond to rows that are similar to one another. On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: What's a good way to calculate

Futures timed out during unpersist

2015-01-16 Thread Kevin (Sangwoo) Kim
Hi experts, I got an error during unpersist RDD. Any ideas? java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at