Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to
workaround them by setting the following JVM options:

-Dspark.akka.askTimeout=300
-Dspark.akka.timeout=300
-Dspark.worker.timeout=300

NOTE: these JVM options *must* be set on worker nodes (and not just the
driver/master) for the settings to take.

Allen





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification.

What is the proper way to configure RDDs when your aggregate data size
exceeds your available working memory size? In particular, in additional to
typical operations, I'm performing cogroups, joins, and coalesces/shuffles.

I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need
to set all the storage level for all of my RDDs to something like
MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
the presence of coalesces/shuffles, cogroups, and joins?

Thanks,
Allen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using Spark on Data size larger than Memory size

2014-06-07 Thread Vibhor Banga
Aaron, Thank You for your response and clarifying things.

-Vibhor


On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson ilike...@gmail.com wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be an
 impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Clearly thr will be impact on performance but frankly depends on what you
 are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore






Re: Using Spark on Data size larger than Memory size

2014-06-06 Thread Roger Hoover
Andrew,

Thank you.  I'm using mapPartitions() but as you say, it requires that
every partition fit in memory.  This will work for now but may not always
work so I was wondering about another way.

Thanks,

Roger


On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Roger,

 You should be able to sort within partitions using the rdd.mapPartitions()
 method, and that shouldn't require holding all data in memory at once.  It
 does require holding the entire partition in memory though.  Do you need
 the partition to never be held in memory all at once?

 As far as the work that Aaron mentioned is happening, I think he might be
 referring to the discussion and code surrounding
 https://issues.apache.org/jira/browse/SPARK-983

 Cheers!
 Andrew


 On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover roger.hoo...@gmail.com
 wrote:

 I think it would very handy to be able to specify that you want sorting
 during a partitioning stage.


 On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com
 wrote:

 Hi Aaron,

 When you say that sorting is being worked on, can you elaborate a little
 more please?

 If particular, I want to sort the items within each partition (not
 globally) without necessarily bringing them all into memory at once.

 Thanks,

 Roger


 On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson ilike...@gmail.com
 wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be
 an impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 Clearly thr will be impact on performance but frankly depends on what
 you are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by
 reading data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the 
 application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore









Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little
more please?

If particular, I want to sort the items within each partition (not
globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson ilike...@gmail.com wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be an
 impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Clearly thr will be impact on performance but frankly depends on what you
 are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore






Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
I think it would very handy to be able to specify that you want sorting
during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com wrote:

 Hi Aaron,

 When you say that sorting is being worked on, can you elaborate a little
 more please?

 If particular, I want to sort the items within each partition (not
 globally) without necessarily bringing them all into memory at once.

 Thanks,

 Roger


 On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson ilike...@gmail.com
 wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be
 an impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Clearly thr will be impact on performance but frankly depends on what
 you are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore







Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once.  It
does require holding the entire partition in memory though.  Do you need
the partition to never be held in memory all at once?

As far as the work that Aaron mentioned is happening, I think he might be
referring to the discussion and code surrounding
https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover roger.hoo...@gmail.com wrote:

 I think it would very handy to be able to specify that you want sorting
 during a partitioning stage.


 On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com
 wrote:

 Hi Aaron,

 When you say that sorting is being worked on, can you elaborate a little
 more please?

 If particular, I want to sort the items within each partition (not
 globally) without necessarily bringing them all into memory at once.

 Thanks,

 Roger


 On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson ilike...@gmail.com
 wrote:

 There is no fundamental issue if you're running on data that is larger
 than cluster memory size. Many operations can stream data through, and thus
 memory usage is independent of input data size. Certain operations require
 an entire *partition* (not dataset) to fit in memory, but there are not
 many instances of this left (sorting comes to mind, and this is being
 worked on).

 In general, one problem with Spark today is that you *can* OOM under
 certain configurations, and it's possible you'll need to change from the
 default configuration if you're using doing very memory-intensive jobs.
 However, there are very few cases where Spark would simply fail as a matter
 of course *-- *for instance, you can always increase the number of
 partitions to decrease the size of any given one. or repartition data to
 eliminate skew.

 Regarding impact on performance, as Mayur said, there may absolutely be
 an impact depending on your jobs. If you're doing a join on a very large
 amount of data with few partitions, then we'll have to spill to disk. If
 you can't cache your working set of data in memory, you will also see a
 performance degradation. Spark enables the use of memory to make things
 fast, but if you just don't have enough memory, it won't be terribly fast.


 On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Clearly thr will be impact on performance but frankly depends on what
 you are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by
 reading data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the 
 application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore








Re: Using Spark on Data size larger than Memory size

2014-06-01 Thread Aaron Davidson
There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many instances of this left (sorting comes to mind, and this is being
worked on).

In general, one problem with Spark today is that you *can* OOM under
certain configurations, and it's possible you'll need to change from the
default configuration if you're using doing very memory-intensive jobs.
However, there are very few cases where Spark would simply fail as a matter
of course *-- *for instance, you can always increase the number of
partitions to decrease the size of any given one. or repartition data to
eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an
impact depending on your jobs. If you're doing a join on a very large
amount of data with few partitions, then we'll have to spill to disk. If
you can't cache your working set of data in memory, you will also see a
performance degradation. Spark enables the use of memory to make things
fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 Clearly thr will be impact on performance but frankly depends on what you
 are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore





Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Vibhor Banga
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows larger
 than the size of RAM available in the cluster, will the application fail,
 or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




-- 
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore


Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you
are trying to achieve with the dataset.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows larger
 than the size of RAM available in the cluster, will the application fail,
 or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore