) for the settings to take.
Allen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Aaron, Thank You for your response and clarifying things.
-Vibhor
On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson ilike...@gmail.com wrote:
There is no fundamental issue if you're running on data that is larger
than cluster memory size. Many operations can stream data through, and thus
Andrew,
Thank you. I'm using mapPartitions() but as you say, it requires that
every partition fit in memory. This will work for now but may not always
work so I was wondering about another way.
Thanks,
Roger
On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash and...@andrewash.com wrote:
Hi Roger,
Hi Aaron,
When you say that sorting is being worked on, can you elaborate a little
more please?
If particular, I want to sort the items within each partition (not
globally) without necessarily bringing them all into memory at once.
Thanks,
Roger
On Sat, May 31, 2014 at 11:10 PM, Aaron
I think it would very handy to be able to specify that you want sorting
during a partitioning stage.
On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com wrote:
Hi Aaron,
When you say that sorting is being worked on, can you elaborate a little
more please?
If particular, I
Hi Roger,
You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once. It
does require holding the entire partition in memory though. Do you need
the partition to never be held in memory all at once?
As far as
There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many
Some inputs will be really helpful.
Thanks,
-Vibhor
On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote:
Hi all,
I am planning to use spark with HBase, where I generate RDD by reading
data from HBase Table.
I want to know that in the case when the size of HBase
Clearly thr will be impact on performance but frankly depends on what you
are trying to achieve with the dataset.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga
Hi all,
I am planning to use spark with HBase, where I generate RDD by reading data
from HBase Table.
I want to know that in the case when the size of HBase Table grows larger
than the size of RAM available in the cluster, will the application fail,
or will there be an impact in performance ?
11 matches
Mail list logo