Hi,
Is there a release where Spark over YARN targeted for ? I presume, it's in
progress at the moment..
Pls correct me if my info is outdated.
Thx
pranay
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise
I am not sure if 32 partitions is a hard limit that you have.
Unless you have a strong reason to use only 32 partitions, please try
providing the second optional
argument (numPartitions) to reduceByKey and sortByKey methods which will
paralellize these Reduce operations.
A number 3x the number of
Hi All,
I am creating a model by calling train method of ALS.
val model = ALS.train(ratings.)
I need to persist this model. Use it from different clients, enable
clients to make predictions using this model. In other words, persist and
reload this model.
Any suggestions, please?
BR,
The model is serializable - so you should be able to write it out to disk
and load it up in another program.
See, e.g. - https://gist.github.com/ramn/5566596 (Note, I haven't tested
this particular example, but it looks alright).
Spark makes use of this type of scala (and kryo, etc.)
Thanks for your answer. But the problem is that I only want to sort the 32
partitions, individually,
not the complete input. So yes, the output has to consist of 32 partitions,
each sorted.
Ceriel Jacobs
On 12/04/2013 06:30 PM, Ashish Rangole wrote:
I am not sure if 32 partitions is a hard
I thought to convert model to RDD and save to HDFS, and then load it.
I will try your method. Thanks a lot.
On Wed, Dec 4, 2013 at 7:41 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
The model is serializable - so you should be able to write it out to disk
and load it up in another program.
Ah, actually - I just remembered that the user and product features of the
model are RDDs, so - you might be better off saving those components to
HDFS and then at load time reading them back in and creating a new
MatrixFactorizationModel. Sorry for the confusion!
Note, the above solution only
Yes, check out the Shark paper for example:
https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
The numbers on that benchmark are for Shark.
Matei
On Dec 3, 2013, at 3:50 PM, Matt Cheah mch...@palantir.com wrote:
Hi everyone,
I notice the benchmark page for
Hi,
Was just curious. In Hadoop, You have a flexibilty that you can chose your
class for SortComparator and GroupingComparator. I have figured out that
there are functions like sortByKey and reduceByKey.
But what if, I want to customize what part of key I want to use for sorting
and which part
I'm reading the paper now, thanks. It states 100-node clusters were used. Is
this typical in the field to have 100 node clusters for the 1TB scale? We were
expecting to be using ~10 nodes.
I'm still pretty new to cluster computing, so just not sure how people have set
these up.
-Matt Cheah
These were EC2 clusters, so the machines were smaller than modern machines. You
can definitely have 1 TB datasets on 10 nodes too. Actually if you’re curious
about hardware configuration, take a look at
http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html.
Also, regarding
Thanks for your answer.
The partitioning function is not that important. What is important that I only
sort the partitions,
not the complete RDD. Your suggestion to use
rdd.distinct.coalesce(32).mapPartitions(p = sorted(p))
sounds nice, and I had indeed seen the coalesce method and the
Hello,
I have a few questions about configuring memory usage on standalone
clusters. Can someone help me out?
1) The terms slave in ./bin/start-slaves.sh and worker in the docs seem
to be used interchangeably. Are they the same?
2) On a worker/slave, is there only one JVM running that has all
It looks like OrderedRDDFunctions (
https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.OrderedRDDFunctions),
which defines sortBy(), is constructed with an implicit Ordered[K], so you
could explicitly construct an OrderedRDDFunctions with your own Ordered.
You
I am trying to bring up a standalone cluster on Windows 8.1/Windows Server 2013
and I am having troubles getting the make-distribution script to complete
successfully. It seems to change a bunch of permissions in /dist and then tries
to write to it unsuccessfully. I assume this is not expected
Hey Adrian,
Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we
provide instead. Cygwin might be made to work but we haven’t tried to do this
so far so it’s not supported. If you can fix it, that would of course be
welcome.
Also, the deploy scripts don’t work on
Hey Roman,
It looks like that pull request was never migrated to the Apache GitHub, but I
like the idea. If you migrate it over, we can merge in something like this. In
terms of the API, I’d just add a unpersist() method on each Broadcast object.
Matei
On Dec 3, 2013, at 6:00 AM, Roman
Hi Friends,
I am new to Spark and Spark streaming. I am trying to save a DStream to
file but can't figure out how to do it with provided methods on DStream
(saveAsTextFiles). Following is what I am trying to do
Eg. DStream of type DStream[(String, String)]
(file1.txt, msg_a), (file1.txt, msg_b),
YARN Alpha API support is already there, If you mean Yarn stable API in hadoop
2.2, it probably will be in 0.8.1
Best Regards,
Raymond Liu
From: Pranay Tonpay [mailto:pranay.ton...@impetus.co.in]
Sent: Thursday, December 05, 2013 12:53 AM
To: user@spark.incubator.apache.org
Subject: Spark over
Hi all,
This should work:
JavaPairRDDByteBuffer, SortedMapByteBuffer, IColumn casRdd =
context.newAPIHadoopRDD(job.getConfiguration(),
ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class),
ByteBuffer.class, SortedMap.class);
I have translated the word count
20 matches
Mail list logo