Re: Comparative study
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your (It was not built to replace Hive. It's purpose-built to make interactive use with a BI tool feasible -- single-digit second queries on huge data sets. It's very memory hungry. Hive's architecture choices and legacy code have been throughput-oriented, and can't really get below minutes at scale, but, remains a right choice when you are in fact doing ETL!)
Re: Comparative study
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your (It was not built to replace Hive. It's purpose-built to make interactive use with a BI tool feasible -- single-digit second queries on huge data sets. It's very memory hungry. Hive's architecture choices and legacy code have been throughput-oriented, and can't really get below minutes at scale, but, remains a right choice when you are in fact doing ETL!)
Re: Comparative study
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it is only Scala (it doesn't wrap a Java framework). All three have fairly similar APIs and aren't too different from Spark. For example, instead of RDD you have DList (distributed list) or PCollection (parallel collection) - or in Scalding's case, Pipe, because Cascading had to get cute with its names. On Mon, Jul 7, 2014 at 8:12 PM, Sean Owen so...@cloudera.com wrote: On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote: For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs on Spark too, not just M/R. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
I don't have those numbers off-hand. Though the shuffle spill to disk was coming to several gigabytes per node, if I recall correctly. The MapReduce pipeline takes about 2-3 hours I think for the full 60 day data set. Spark chugs along fine for awhile and then hangs. We restructured the flow a few times, but in the last iteration it was hanging when trying to save the feature profiles with just a couple of tasks remaining (those tasks ran for 10+ hours before we killed it). In a previous iteration we did get it to run through. We broke our flow into two parts though - first saving the raw profiles out to disk, then reading them back in for scoring. That was on just 10 days of data, by the way - one sixth of what the MapReduce flow normally runs through on the same cluster. I haven't tracked down the cause. YMMV On Mon, Jul 7, 2014 at 8:14 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Comparative study
I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId - profile). I think it's hard to know how Spark will hold up without trying yourself, on your own flow. Also, keep in mind this was with a Spark Standalone cluster - perhaps Mesos or YARN would hold up better. On Tue, Jul 8, 2014 at 1:04 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory demand, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at
Re: Comparative study
To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory *demand*, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
Re: Comparative study
Nothing particularly custom. We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total memory resources ranging from 4GB to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory demand, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
Re: Comparative study
How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com wrote: Nothing particularly custom. We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total memory resources ranging from 4GB to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory *demand*, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and
Re: Comparative study
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com wrote: Nothing particularly custom. We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total memory resources ranging from 4GB to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory *demand*, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some
Re: Comparative study
We kind of hijacked Santos' original thread, so apologies for that and let me try to get back to Santos' original question on Map/Reduce versus Spark. I would say it's worth migrating from M/R, with the following thoughts. Just my opinion but I would summarize the latest emails in this thread as Spark can scale to datasets in 10s and 100s of GBs. I've seen some companies talk about TBs of data but I'm unclear if that is for a single flow. At the same time, some folks (like my team) that I've seen on the user group have a lot of difficulty with the same sized datasets, which points to either environmental issues (machines, cluster mode, etc.), nature of the data or nature of the transforms/flow complexity (though Kevin's experience runs counter to the latter, which is very positive) or we are just doing something subtle wrong. My overall opinion right now is Map/Reduce is easier to get working in general on very large, heterogeneous datasets but the programming model for Spark is the right way to go and worth the effort. Libraries like Scoobi, Scrunch and Scalding (and their associated Java versions) provide a Spark-like wrapper around Map/Reduce but my guess is that, since they are limited to Map/Reduce under the covers, they cannot do some of the optimizations that Spark can, such as collapsing several transforms into a single stage. In addition, my company believes that having batch, streaming and SQL (ad hoc querying) on a single platform has worthwhile benefits. We're still relatively new with Spark (a few months), so would also love to hear more from others in the community. -Suren On Tue, Jul 8, 2014 at 2:17 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com wrote: Nothing particularly custom. We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total memory resources ranging from 4GB to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory *demand*, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes.
Re: Comparative study
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Libraries like Scoobi, Scrunch and Scalding (and their associated Java versions) provide a Spark-like wrapper around Map/Reduce but my guess is that, since they are limited to Map/Reduce under the covers, they cannot do some of the optimizations that Spark can, such as collapsing several transforms into a single stage. Just wanted to reiterate that this is not true. For example (S)Crunch does optimizations of this sort too, and can execute on Spark.
Re: Comparative study
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed my disks fairly full that the file system is close to breaking. We can definitely do a better job in Spark to make it output more meaningful diagnosis and more robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Comparative study
I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs. I don't know *why* that happened. Maybe it isn't the overall amount of data, but something I'm doing wrong with my flow. In any case, improvements to diagnostic info would probably be helpful. I look forward to the next release. :-) On Tue, Jul 8, 2014 at 3:47 PM, Reynold Xin r...@databricks.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed my disks fairly full that the file system is close to breaking. We can definitely do a better job in Spark to make it output more meaningful diagnosis and more robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW
Re: Comparative study
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong and how to fix it that we're actively working on, but the first impulse when a job fails should be what did I do wrong rather than Spark can't handle this workload. Messaging is a huge part in making this clear -- getting things like a job hanging or an out of memory error can be very difficult to debug, and improving this is one of our highest priorties. On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed my disks fairly full that the file system is close to breaking. We can definitely do a better job in Spark to make it output more meaningful diagnosis and more robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W:
Re: Comparative study
Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be what did I do wrong. I believe that is what every person facing this issue has done, in reaching out to the user group repeatedly over the course of the few of months that I've been active here. I also know other companies (all experienced with large production datasets on other platforms) facing the same types of issues - flows that run on subsets of data but not the whole production set. So I think, as you are saying, it points to the need for further diagnostics. And maybe also some type of guidance on typical issues with different types of datasets (wide rows, narrow rows, etc.), flow topologies. etc.? Hard to tell where we are going wrong right now. We've tried many things over the course of 6 weeks or so. I tried to look for the professional services link on databricks.com but didn't find it. ;-) (jk). -Suren On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong and how to fix it that we're actively working on, but the first impulse when a job fails should be what did I do wrong rather than Spark can't handle this workload. Messaging is a huge part in making this clear -- getting things like a job hanging or an out of memory error can be very difficult to debug, and improving this is one of our highest priorties. On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed my disks fairly full that the file system is close to breaking. We can definitely do a better job in Spark to make it output more meaningful diagnosis and more robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to
Re: Comparative study
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and run time to the JVM, due to dependency conflicts, having to use uber jars, Spark's own uber jar which includes some very old libs, etc. What's more, there's very little resources available to help. Some times I've been able to get help via public sources, but, more often than not, it's been trial and error. Enough that, despite Spark's unmistakable appeal, we are seriously considering dropping it entirely and just doing a classical Hadoop. On 7/8/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be what did I do wrong. I believe that is what every person facing this issue has done, in reaching out to the user group repeatedly over the course of the few of months that I've been active here. I also know other companies (all experienced with large production datasets on other platforms) facing the same types of issues - flows that run on subsets of data but not the whole production set. So I think, as you are saying, it points to the need for further diagnostics. And maybe also some type of guidance on typical issues with different types of datasets (wide rows, narrow rows, etc.), flow topologies. etc.? Hard to tell where we are going wrong right now. We've tried many things over the course of 6 weeks or so. I tried to look for the professional services link on databricks.com but didn't find it. ;-) (jk). -Suren On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong and how to fix it that we're actively working on, but the first impulse when a job fails should be what did I do wrong rather than Spark can't handle this workload. Messaging is a huge part in making this clear -- getting things like a job hanging or an out of memory error can be very difficult to debug, and improving this is one of our highest priorties. On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed my disks fairly full that the file system is close to breaking. We can definitely do a better job in Spark to make it output more meaningful diagnosis and more robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe
Re: Comparative study
Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your entire dataset fits into memory and your processing can be expressed in terms of sql. Paired with the column oriented Parquet format, it can really scream with the right dataset. Spark also has a SQL layer (formely shark, now more tightly integrated with Spark), but at least for our dataset, Impala was faster. However, Spark has a fantastic and far more flexible programming model. As has been mentioned a few times in this thread, it has a better batch processing model than map/reduce, it can do stream processing, and in the newest release, it looks like it can even mix and match sql queries. You do need to be more aware of memory issues than map/reduce, since using more memory is one of the primary sources of Sparks speed, but with that caveat, its a great technology. In our particular workflow, we're replacing map/reduce with spark for our batch layer and using Impala for our query layer. Daniel, For what it's worth, we've had a bunch of hanging issues because the garbage collector seems to get out of control. The most effective technique has been to dramatically increase the numPartition argument in our various groupBy and cogroup calls which reduces the per-task memory requirements. We also reduced the memory used by the shuffler ( spark.shuffle.memoryFraction) and turned off RDD memory (since we don't have any iterative algorithms). Finally, using kryo delivered a hug performance and memory boost (even without registering any custom serializers). Keith On Tue, Jul 8, 2014 at 2:58 PM, Robert James srobertja...@gmail.com wrote: As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and run time to the JVM, due to dependency conflicts, having to use uber jars, Spark's own uber jar which includes some very old libs, etc. What's more, there's very little resources available to help. Some times I've been able to get help via public sources, but, more often than not, it's been trial and error. Enough that, despite Spark's unmistakable appeal, we are seriously considering dropping it entirely and just doing a classical Hadoop. On 7/8/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be what did I do wrong. I believe that is what every person facing this issue has done, in reaching out to the user group repeatedly over the course of the few of months that I've been active here. I also know other companies (all experienced with large production datasets on other platforms) facing the same types of issues - flows that run on subsets of data but not the whole production set. So I think, as you are saying, it points to the need for further diagnostics. And maybe also some type of guidance on typical issues with different types of datasets (wide rows, narrow rows, etc.), flow topologies. etc.? Hard to tell where we are going wrong right now. We've tried many things over the course of 6 weeks or so. I tried to look for the professional services link on databricks.com but didn't find it. ;-) (jk). -Suren On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong and how to fix it that we're actively working on, but the first impulse when a job fails should be what did I do wrong rather than Spark can't handle this workload. Messaging is a huge part in making this clear -- getting things like a job hanging or an out of memory error can be very difficult to debug, and improving this is one of our highest priorties. On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote: Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of
Re: Comparative study
From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
RE: Comparative study
Thanks Daniel for sharing this info. Regards, Santosh Karthikeyan From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Tuesday, July 08, 2014 1:10 AM To: user@spark.apache.org Subject: Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.commailto:santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.comhttp://www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: www.velos.iohttp://www.velos.io
Re: Comparative study
For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. However, Spark vs Impala doesn't make sense to me. It should've really been Shark vs Impala. Both are SQL querying engines built on top of Spark and Hadoop (map/reduce engine) respectively. On Mon, Jul 7, 2014 at 4:06 PM, santosh.viswanat...@accenture.com wrote: Thanks Daniel for sharing this info. Regards, Santosh Karthikeyan *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io] *Sent:* Tuesday, July 08, 2014 1:10 AM *To:* user@spark.apache.org *Subject:* Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote: For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs on Spark too, not just M/R.
Re: Comparative study
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io