Spark over YARN

2013-12-04 Thread Pranay Tonpay
Hi, Is there a release where Spark over YARN targeted for ? I presume, it's in progress at the moment.. Pls correct me if my info is outdated. Thx pranay NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ashish Rangole
I am not sure if 32 partitions is a hard limit that you have. Unless you have a strong reason to use only 32 partitions, please try providing the second optional argument (numPartitions) to reduceByKey and sortByKey methods which will paralellize these Reduce operations. A number 3x the number of

Persisting MatrixFactorizationModel

2013-12-04 Thread Aslan Bekirov
Hi All, I am creating a model by calling train method of ALS. val model = ALS.train(ratings.) I need to persist this model. Use it from different clients, enable clients to make predictions using this model. In other words, persist and reload this model. Any suggestions, please? BR,

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Evan R. Sparks
The model is serializable - so you should be able to write it out to disk and load it up in another program. See, e.g. - https://gist.github.com/ramn/5566596 (Note, I haven't tested this particular example, but it looks alright). Spark makes use of this type of scala (and kryo, etc.)

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ceriel Jacobs
Thanks for your answer. But the problem is that I only want to sort the 32 partitions, individually, not the complete input. So yes, the output has to consist of 32 partitions, each sorted. Ceriel Jacobs On 12/04/2013 06:30 PM, Ashish Rangole wrote: I am not sure if 32 partitions is a hard

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Aslan Bekirov
I thought to convert model to RDD and save to HDFS, and then load it. I will try your method. Thanks a lot. On Wed, Dec 4, 2013 at 7:41 PM, Evan R. Sparks evan.spa...@gmail.comwrote: The model is serializable - so you should be able to write it out to disk and load it up in another program.

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Evan R. Sparks
Ah, actually - I just remembered that the user and product features of the model are RDDs, so - you might be better off saving those components to HDFS and then at load time reading them back in and creating a new MatrixFactorizationModel. Sorry for the confusion! Note, the above solution only

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/ The numbers on that benchmark are for Shark. Matei On Dec 3, 2013, at 3:50 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, I notice the benchmark page for

Fwd: GroupingComparator in Spark.

2013-12-04 Thread Archit Thakur
Hi, Was just curious. In Hadoop, You have a flexibilty that you can chose your class for SortComparator and GroupingComparator. I have figured out that there are functions like sortByKey and reduceByKey. But what if, I want to customize what part of key I want to use for sorting and which part

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matt Cheah
I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes. I'm still pretty new to cluster computing, so just not sure how people have set these up. -Matt Cheah

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
These were EC2 clusters, so the machines were smaller than modern machines. You can definitely have 1 TB datasets on 10 nodes too. Actually if you’re curious about hardware configuration, take a look at http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html. Also, regarding

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ceriel Jacobs
Thanks for your answer. The partitioning function is not that important. What is important that I only sort the partitions, not the complete RDD. Your suggestion to use rdd.distinct.coalesce(32).mapPartitions(p = sorted(p)) sounds nice, and I had indeed seen the coalesce method and the

Memory configuration of standalone clusters

2013-12-04 Thread Andrew Ash
Hello, I have a few questions about configuring memory usage on standalone clusters. Can someone help me out? 1) The terms slave in ./bin/start-slaves.sh and worker in the docs seem to be used interchangeably. Are they the same? 2) On a worker/slave, is there only one JVM running that has all

Re: GroupingComparator in Spark.

2013-12-04 Thread Josh Rosen
It looks like OrderedRDDFunctions ( https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.OrderedRDDFunctions), which defines sortBy(), is constructed with an implicit Ordered[K], so you could explicitly construct an OrderedRDDFunctions with your own Ordered. You

Spark build/standalone on Windows 8.1

2013-12-04 Thread Adrian Bonar
I am trying to bring up a standalone cluster on Windows 8.1/Windows Server 2013 and I am having troubles getting the make-distribution script to complete successfully. It seems to change a bunch of permissions in /dist and then tries to write to it unsuccessfully. I assume this is not expected

Re: Pre-build Spark for Windows 8.1

2013-12-04 Thread Matei Zaharia
Hey Adrian, Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we provide instead. Cygwin might be made to work but we haven’t tried to do this so far so it’s not supported. If you can fix it, that would of course be welcome. Also, the deploy scripts don’t work on

Re: Removing broadcasts

2013-12-04 Thread Matei Zaharia
Hey Roman, It looks like that pull request was never migrated to the Apache GitHub, but I like the idea. If you migrate it over, we can merge in something like this. In terms of the API, I’d just add a unpersist() method on each Broadcast object. Matei On Dec 3, 2013, at 6:00 AM, Roman

Question about saveAsTextFile on DStream

2013-12-04 Thread Parth Patil
Hi Friends, I am new to Spark and Spark streaming. I am trying to save a DStream to file but can't figure out how to do it with provided methods on DStream (saveAsTextFiles). Following is what I am trying to do Eg. DStream of type DStream[(String, String)] (file1.txt, msg_a), (file1.txt, msg_b),

RE: Spark over YARN

2013-12-04 Thread Liu, Raymond
YARN Alpha API support is already there, If you mean Yarn stable API in hadoop 2.2, it probably will be in 0.8.1 Best Regards, Raymond Liu From: Pranay Tonpay [mailto:pranay.ton...@impetus.co.in] Sent: Thursday, December 05, 2013 12:53 AM To: user@spark.incubator.apache.org Subject: Spark over

Re: Using Cassandra as an input stream from Java

2013-12-04 Thread Lucas Fernandes Brunialti
Hi all, This should work: JavaPairRDDByteBuffer, SortedMapByteBuffer, IColumn casRdd = context.newAPIHadoopRDD(job.getConfiguration(), ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class), ByteBuffer.class, SortedMap.class); I have translated the word count