Re: Comparative study

2014-07-09 Thread Sean Owen
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace Hive,
 which is map/reduce based.  It has its own distributed query engine, though
 it does load data from HDFS, and is part of the hadoop ecosystem.  Impala
 really shines when your


(It was not built to replace Hive. It's purpose-built to make interactive
use with a BI tool feasible -- single-digit second queries on huge data
sets. It's very memory hungry. Hive's architecture choices and legacy code
have been throughput-oriented, and can't really get below minutes at scale,
but, remains a right choice when you are in fact doing ETL!)


Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point.  Shows how personal use cases color how we interpret products.


On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:

 On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace
 Hive, which is map/reduce based.  It has its own distributed query engine,
 though it does load data from HDFS, and is part of the hadoop ecosystem.
  Impala really shines when your


 (It was not built to replace Hive. It's purpose-built to make interactive
 use with a BI tool feasible -- single-digit second queries on huge data
 sets. It's very memory hungry. Hive's architecture choices and legacy code
 have been throughput-oriented, and can't really get below minutes at scale,
 but, remains a right choice when you are in fact doing ETL!)



Re: Comparative study

2014-07-08 Thread Daniel Siegmann
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it
is only Scala (it doesn't wrap a Java framework). All three have fairly
similar APIs and aren't too different from Spark. For example, instead of
RDD you have DList (distributed list) or PCollection (parallel collection)
- or in Scalding's case, Pipe, because Cascading had to get cute with its
names.


On Mon, Jul 7, 2014 at 8:12 PM, Sean Owen so...@cloudera.com wrote:

 On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote:

 For Scala API on map/reduce (hadoop engine) there's a library called
 Scalding. It's built on top of Cascading. If you have a huge dataset or
 if you consider using map/reduce engine for your job, for any reason, you
 can try Scalding.


 PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs
 on Spark too, not just M/R.





-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I don't have those numbers off-hand. Though the shuffle spill to disk was
coming to several gigabytes per node, if I recall correctly.

The MapReduce pipeline takes about 2-3 hours I think for the full 60 day
data set. Spark chugs along fine for awhile and then hangs. We restructured
the flow a few times, but in the last iteration it was hanging when trying
to save the feature profiles with just a couple of tasks remaining (those
tasks ran for 10+ hours before we killed it). In a previous iteration we
did get it to run through. We broke our flow into two parts though - first
saving the raw profiles out to disk, then reading them back in for scoring.

That was on just 10 days of data, by the way - one sixth of what the
MapReduce flow normally runs through on the same cluster.

I haven't tracked down the cause. YMMV


On Mon, Jul 7, 2014 at 8:14 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:



 Daniel,

 Do you mind sharing the size of your cluster and the production data
 volumes ?

 Thanks
 Soumya

 On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io
 wrote:

 From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

 Unfortunately, the picture isn't so rosy when it gets to production. In my
 experience, Spark simply doesn't scale to the volumes that MapReduce will
 handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
 better, but I haven't had the opportunity to try them. I find jobs tend to
 just hang forever for no apparent reason on large data sets (but smaller
 than what I push through MapReduce).

 I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

 Keep in mind there are some frameworks for Hadoop which can hide the ugly
 MapReduce with something very similar in form to Spark's API; e.g. Apache
 Crunch. So you might consider those as well.

 (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
When you say "large data sets", how large?
Thanks

On 07/07/2014 01:39 PM, Daniel Siegmann
  wrote:


  

  From a development perspective, I vastly prefer Spark to
MapReduce. The MapReduce API is very constrained; Spark's
API feels much more natural to me. Testing and local
development is also very easy - creating a local Spark
context is trivial and it reads local files. For your unit
tests you can just have them create a local context and
execute your flow with some test data. Even better, you can
do ad-hoc work in the Spark shell and if you want that in
your production code it will look exactly the same.

  
  Unfortunately, the picture isn't so rosy when it gets to
production. In my experience, Spark simply doesn't scale to
the volumes that MapReduce will handle. Not with a
Standalone cluster anyway - maybe Mesos or YARN would be
better, but I haven't had the opportunity to try them. I
find jobs tend to just hang forever for no apparent reason
on large data sets (but smaller than what I push through
MapReduce).

  
  I am hopeful the situation will improve - Spark is
developing quickly - but if you have large amounts of data
you should proceed with caution.

  
  Keep in mind there are some frameworks for Hadoop which
can hide the ugly MapReduce with something very similar in
form to Spark's API; e.g. Apache Crunch. So you might
consider those as well.

  
  (Note: the above is with Spark 1.0.0.)
  
  
  

  
  

On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
  wrote:
  

  
Hello Experts,
 
I am doing some comparative study
  on the below:
 
Spark vs Impala
Spark vs MapREduce . Is it worth
  migrating from existing MR implementation to Spark?
 
 
Please share your thoughts and
  expertise.
 
 
Thanks,
  Santosh
  
  
  
  
This message is for the designated recipient only and
may contain privileged, proprietary, or otherwise
confidential information. If you have received it in
error, please notify the sender immediately and delete
the original. Any other use of the e-mail by you is
prohibited. Where allowed by local law, electronic
communications with Accenture and its affiliates,
including e-mail and instant messaging (including
content), may be scanned by our systems for the purposes
of information security and assessment of internal
compliance with Accenture policy. 
__

www.accenture.com
  

  




-- 

  Daniel
  Siegmann, Software Developer
Velos

  Accelerating
Machine Learning
  
  
440 NINTH AVENUE, 11TH FLOOR,
NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

  


  



Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan.

Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).

I'm not sure what the size of the final output data was but I think it was
on the order of 20 GBs for the given 10 GB of input data. Also, I can say
that when we were experimenting with persist(DISK_ONLY), the size of all
RDDs on disk was around 200 GB, which gives a sense of overall transient
memory usage with no persistence.

In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2
workers each. Each executor got 14 GB of memory.

-Suren



On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production. In
 my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
  Daniel Siegmann, Software Developer
 Velos
  Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I believe our full 60 days of data contains over ten million unique
entities. Across 10 days I'm not sure, but it should be in the millions. I
haven't verified that myself though. So that's the scale of the RDD we're
writing to disk (each entry is entityId - profile).

I think it's hard to know how Spark will hold up without trying yourself,
on your own flow. Also, keep in mind this was with a Spark Standalone
cluster - perhaps Mesos or YARN would hold up better.


On Tue, Jul 8, 2014 at 1:04 PM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it was
 on the order of 20 GBs for the given 10 GB of input data. Also, I can say
 that when we were experimenting with persist(DISK_ONLY), the size of all
 RDDs on disk was around 200 GB, which gives a sense of overall transient
 memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production. In
 my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
  Daniel Siegmann, Software Developer
 Velos
  Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only.  While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time. 
We've evaluated data sets much greater than 10GB in Spark using the
Spark master and Spark with Yarn (cluster -- formerly standalone --
mode).  Nice thing about using Yarn is that it reports the actual
memory demand, not just the memory requested for driver and
workers.  Processing a 60GB data set through thousands of stages in
a rather complex set of analytics and transformations consumed a
total cluster resource (divided among all workers and driver) of
only 9GB.  We were somewhat startled at first by this result,
thinking that it would be much greater, but realized that it is a
consequence of Spark's lazy evaluation model.  This is even with
several intermediate computations being cached as input to multiple
evaluation paths.  

Good luck.

Kevin


On 07/08/2014 11:04 AM, Surendranauth
  Hiraman wrote:


  I'll respond for Dan.


Our test dataset was a total of 10 GB of input data (full
  production dataset for this particular dataflow would be 60 GB
  roughly). 


I'm not sure what the size of the final output data was but
  I think it was on the order of 20 GBs for the given 10 GB of
  input data. Also, I can say that when we were experimenting
  with persist(DISK_ONLY), the size of all RDDs on disk was
  around 200 GB, which gives a sense of overall transient memory
  usage with no persistence.


In terms of our test cluster, we had 15 nodes. Each node
  had 24 cores and 2 workers each. Each executor got 14 GB of
  memory.


-Suren


  
  


On Tue, Jul 8, 2014 at 12:06 PM, Kevin
  Markey kevin.mar...@oracle.com
  wrote:
  
 When you say "large
  data sets", how large?
  Thanks
  

  
  On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
  
  

  
From a development perspective, I vastly
  prefer Spark to MapReduce. The MapReduce API
  is very constrained; Spark's API feels much
  more natural to me. Testing and local
  development is also very easy - creating a
  local Spark context is trivial and it reads
  local files. For your unit tests you can just
  have them create a local context and execute
  your flow with some test data. Even better,
  you can do ad-hoc work in the Spark shell and
  if you want that in your production code it
  will look exactly the same.
  

Unfortunately, the picture isn't so rosy
  when it gets to production. In my experience,
  Spark simply doesn't scale to the volumes that
  MapReduce will handle. Not with a Standalone
  cluster anyway - maybe Mesos or YARN would be
  better, but I haven't had the opportunity to
  try them. I find jobs tend to just hang
  forever for no apparent reason on large data
  sets (but smaller than what I push through
  MapReduce).
  

I am hopeful the situation will improve -
  Spark is developing quickly - but if you have
  large amounts of data you should proceed with
  caution.
  

Keep in mind there are some frameworks for
  Hadoop which can hide the ugly MapReduce with
  something very similar in form to Spark's API;
  e.g. Apache Crunch. So you might consider
  those as well.
  

(Note: the above is with Spark 1.0.0.)



  


  
  On Mon, Jul 7, 2014 at
 

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
To clarify, we are not persisting to disk. That was just one of the
experiments we did because of some issues we had along the way.

At this time, we are NOT using persist but cannot get the flow to complete
in Standalone Cluster mode. We do not have a YARN-capable cluster at this
time.

We agree with what you're saying. Your results are what we were hoping for
and expecting. :-)  Unfortunately we still haven't gotten the flow to run
end to end on this relatively small dataset.

It must be something related to our cluster, standalone mode or our flow
but as far as we can tell, we are not doing anything unusual.

Did you do any custom configuration? Any advice would be appreciated.

-Suren




On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com
wrote:

  It seems to me that you're not taking full advantage of the lazy
 evaluation, especially persisting to disk only.  While it might be true
 that the cumulative size of the RDDs looks like it's 300GB, only a small
 portion of that should be resident at any one time.  We've evaluated data
 sets much greater than 10GB in Spark using the Spark master and Spark with
 Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
 is that it reports the actual memory *demand*, not just the memory
 requested for driver and workers.  Processing a 60GB data set through
 thousands of stages in a rather complex set of analytics and
 transformations consumed a total cluster resource (divided among all
 workers and driver) of only 9GB.  We were somewhat startled at first by
 this result, thinking that it would be much greater, but realized that it
 is a consequence of Spark's lazy evaluation model.  This is even with
 several intermediate computations being cached as input to multiple
 evaluation paths.

 Good luck.

 Kevin



 On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:

 I'll respond for Dan.

  Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

  I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
 and 2 workers each. Each executor got 14 GB of memory.

  -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production. In
 my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 

Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
Nothing particularly custom.  We've tested with small (4 node)
development clusters, single-node pseudoclusters, and bigger, using
plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark
master, Spark local, Spark Yarn (client and cluster) modes, with
total memory resources ranging from 4GB to 256GB+.  

K


On 07/08/2014 12:04 PM, Surendranauth
  Hiraman wrote:


  To clarify, we are not persisting to disk. That was
just one of the experiments we did because of some issues we had
along the way.


At this time, we are NOT using persist but cannot get the
  flow to complete in Standalone Cluster mode. We do not have a
  YARN-capable cluster at this time.


We agree with what you're saying. Your results are what we
  were hoping for and expecting. :-)  Unfortunately we still
  haven't gotten the flow to run end to end on this relatively
  small dataset.


It must be something related to our cluster, standalone
  mode or our flow but as far as we can tell, we are not doing
  anything unusual.


Did you do any custom configuration? Any advice would be
  appreciated.


-Suren




  
  

On Tue, Jul 8, 2014 at 1:54 PM, Kevin
  Markey kevin.mar...@oracle.com
  wrote:
  
 It seems to me that
  you're not taking full advantage of the lazy evaluation,
  especially persisting to disk only.  While it might be
  true that the cumulative size of the RDDs looks like it's
  300GB, only a small portion of that should be resident at
  any one time.  We've evaluated data sets much greater than
  10GB in Spark using the Spark master and Spark with Yarn
  (cluster -- formerly standalone -- mode).  Nice thing
  about using Yarn is that it reports the actual memory demand,
  not just the memory requested for driver and workers. 
  Processing a 60GB data set through thousands of stages in
  a rather complex set of analytics and transformations
  consumed a total cluster resource (divided among all
  workers and driver) of only 9GB.  We were somewhat
  startled at first by this result, thinking that it would
  be much greater, but realized that it is a consequence of
  Spark's lazy evaluation model.  This is even with several
  intermediate computations being cached as input to
  multiple evaluation paths.  
  
  Good luck.
  
  Kevin
  

  
  
  On 07/08/2014 11:04 AM, Surendranauth Hiraman
wrote:
  
  
I'll respond for Dan.
  
  
  Our test dataset was a total of 10 GB of
input data (full production dataset for this
particular dataflow would be 60 GB roughly). 
  
  
  I'm not sure what the size of the final
output data was but I think it was on the order
of 20 GBs for the given 10 GB of input data.
Also, I can say that when we were experimenting
with persist(DISK_ONLY), the size of all RDDs on
disk was around 200 GB, which gives a sense of
overall transient memory usage with no
persistence.
  
  
  In terms of our test cluster, we had 15
nodes. Each node had 24 cores and 2 workers
each. Each executor got 14 GB of memory.
  
  
  -Suren
  
  

 
  
  On Tue, Jul 8, 2014 at
12:06 PM, Kevin Markey kevin.mar...@oracle.com
wrote:

   When
you say "large data sets", how large?
Thanks

  

On 07/07/2014 01:39 PM, Daniel
  Siegmann wrote:


  

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
How wide are the rows of data, either the raw input data or any generated
intermediate data?

We are at a loss as to why our flow doesn't complete. We banged our heads
against it for a few weeks.

-Suren



On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com
wrote:

  Nothing particularly custom.  We've tested with small (4 node)
 development clusters, single-node pseudoclusters, and bigger, using
 plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
 Spark local, Spark Yarn (client and cluster) modes, with total memory
 resources ranging from 4GB to 256GB+.

 K



 On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:

 To clarify, we are not persisting to disk. That was just one of the
 experiments we did because of some issues we had along the way.

  At this time, we are NOT using persist but cannot get the flow to
 complete in Standalone Cluster mode. We do not have a YARN-capable cluster
 at this time.

  We agree with what you're saying. Your results are what we were hoping
 for and expecting. :-)  Unfortunately we still haven't gotten the flow to
 run end to end on this relatively small dataset.

  It must be something related to our cluster, standalone mode or our flow
 but as far as we can tell, we are not doing anything unusual.

  Did you do any custom configuration? Any advice would be appreciated.

  -Suren




 On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  It seems to me that you're not taking full advantage of the lazy
 evaluation, especially persisting to disk only.  While it might be true
 that the cumulative size of the RDDs looks like it's 300GB, only a small
 portion of that should be resident at any one time.  We've evaluated data
 sets much greater than 10GB in Spark using the Spark master and Spark with
 Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
 is that it reports the actual memory *demand*, not just the memory
 requested for driver and workers.  Processing a 60GB data set through
 thousands of stages in a rather complex set of analytics and
 transformations consumed a total cluster resource (divided among all
 workers and driver) of only 9GB.  We were somewhat startled at first by
 this result, thinking that it would be much greater, but realized that it
 is a consequence of Spark's lazy evaluation model.  This is even with
 several intermediate computations being cached as input to multiple
 evaluation paths.

 Good luck.

 Kevin



 On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:

 I'll respond for Dan.

  Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

  I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
 and 2 workers each. Each executor got 14 GB of memory.

  -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more natural
 to me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly
 - but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and 

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine.

-Suren


On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 How wide are the rows of data, either the raw input data or any generated
 intermediate data?

 We are at a loss as to why our flow doesn't complete. We banged our heads
 against it for a few weeks.

 -Suren



 On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  Nothing particularly custom.  We've tested with small (4 node)
 development clusters, single-node pseudoclusters, and bigger, using
 plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
 Spark local, Spark Yarn (client and cluster) modes, with total memory
 resources ranging from 4GB to 256GB+.

 K



 On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:

 To clarify, we are not persisting to disk. That was just one of the
 experiments we did because of some issues we had along the way.

  At this time, we are NOT using persist but cannot get the flow to
 complete in Standalone Cluster mode. We do not have a YARN-capable cluster
 at this time.

  We agree with what you're saying. Your results are what we were hoping
 for and expecting. :-)  Unfortunately we still haven't gotten the flow to
 run end to end on this relatively small dataset.

  It must be something related to our cluster, standalone mode or our
 flow but as far as we can tell, we are not doing anything unusual.

  Did you do any custom configuration? Any advice would be appreciated.

  -Suren




 On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  It seems to me that you're not taking full advantage of the lazy
 evaluation, especially persisting to disk only.  While it might be true
 that the cumulative size of the RDDs looks like it's 300GB, only a small
 portion of that should be resident at any one time.  We've evaluated data
 sets much greater than 10GB in Spark using the Spark master and Spark with
 Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
 is that it reports the actual memory *demand*, not just the memory
 requested for driver and workers.  Processing a 60GB data set through
 thousands of stages in a rather complex set of analytics and
 transformations consumed a total cluster resource (divided among all
 workers and driver) of only 9GB.  We were somewhat startled at first by
 this result, thinking that it would be much greater, but realized that it
 is a consequence of Spark's lazy evaluation model.  This is even with
 several intermediate computations being cached as input to multiple
 evaluation paths.

 Good luck.

 Kevin



 On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:

 I'll respond for Dan.

  Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

  I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
 and 2 workers each. Each executor got 14 GB of memory.

  -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more natural
 to me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly
 - but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some 

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
We kind of hijacked Santos' original thread, so apologies for that and let
me try to get back to Santos' original question on Map/Reduce versus Spark.

I would say it's worth migrating from M/R, with the following thoughts.

Just my opinion but I would summarize the latest emails in this thread as
Spark can scale to datasets in 10s and 100s of GBs. I've seen some
companies talk about TBs of data but I'm unclear if that is for a single
flow.

At the same time, some folks (like my team) that I've seen on the user
group have a lot of difficulty with the same sized datasets, which points
to either environmental issues (machines, cluster mode, etc.), nature of
the data or nature of the transforms/flow complexity (though Kevin's
experience runs counter to the latter, which is very positive) or we are
just doing something subtle wrong.

My overall opinion right now is Map/Reduce is easier to get working in
general on very large, heterogeneous datasets but the programming model for
Spark is the right way to go and worth the effort.

Libraries like Scoobi, Scrunch and Scalding (and their associated Java
versions) provide a Spark-like wrapper around Map/Reduce but my guess is
that, since they are limited to Map/Reduce under the covers, they cannot do
some of the optimizations that Spark can, such as collapsing several
transforms into a single stage.

In addition, my company believes that having batch, streaming and SQL (ad
hoc querying) on a single platform has worthwhile benefits.

We're still relatively new with Spark (a few months), so would also love to
hear more from others in the community.

-Suren



On Tue, Jul 8, 2014 at 2:17 PM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 Also, our exact same flow but with 1 GB of input data completed fine.

 -Suren


 On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 How wide are the rows of data, either the raw input data or any generated
 intermediate data?

 We are at a loss as to why our flow doesn't complete. We banged our heads
 against it for a few weeks.

 -Suren



 On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  Nothing particularly custom.  We've tested with small (4 node)
 development clusters, single-node pseudoclusters, and bigger, using
 plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
 Spark local, Spark Yarn (client and cluster) modes, with total memory
 resources ranging from 4GB to 256GB+.

 K



 On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:

 To clarify, we are not persisting to disk. That was just one of the
 experiments we did because of some issues we had along the way.

  At this time, we are NOT using persist but cannot get the flow to
 complete in Standalone Cluster mode. We do not have a YARN-capable cluster
 at this time.

  We agree with what you're saying. Your results are what we were hoping
 for and expecting. :-)  Unfortunately we still haven't gotten the flow to
 run end to end on this relatively small dataset.

  It must be something related to our cluster, standalone mode or our
 flow but as far as we can tell, we are not doing anything unusual.

  Did you do any custom configuration? Any advice would be appreciated.

  -Suren




 On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  It seems to me that you're not taking full advantage of the lazy
 evaluation, especially persisting to disk only.  While it might be true
 that the cumulative size of the RDDs looks like it's 300GB, only a small
 portion of that should be resident at any one time.  We've evaluated data
 sets much greater than 10GB in Spark using the Spark master and Spark with
 Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
 is that it reports the actual memory *demand*, not just the memory
 requested for driver and workers.  Processing a 60GB data set through
 thousands of stages in a rather complex set of analytics and
 transformations consumed a total cluster resource (divided among all
 workers and driver) of only 9GB.  We were somewhat startled at first by
 this result, thinking that it would be much greater, but realized that it
 is a consequence of Spark's lazy evaluation model.  This is even with
 several intermediate computations being cached as input to multiple
 evaluation paths.

 Good luck.

 Kevin



 On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:

 I'll respond for Dan.

  Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

  I'm not sure what the size of the final output data was but I think
 it was on the order of 20 GBs for the given 10 GB of input data. Also, I
 can say that when we were experimenting with persist(DISK_ONLY), the size
 of all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

  In terms of our test cluster, we had 15 nodes. 

Re: Comparative study

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 Libraries like Scoobi, Scrunch and Scalding (and their associated Java
 versions) provide a Spark-like wrapper around Map/Reduce but my guess is
 that, since they are limited to Map/Reduce under the covers, they cannot do
 some of the optimizations that Spark can, such as collapsing several
 transforms into a single stage.


Just wanted to reiterate that this is not true. For example (S)Crunch does
optimizations of this sort too, and can execute on Spark.


Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.

I've personally run a workload that shuffled 300 TB of data. I've also ran
something that shuffled 5TB/node and stuffed my disks fairly full that the
file system is close to breaking.

We can definitely do a better job in Spark to make it output more
meaningful diagnosis and more robust with partitions of data that don't fit
in memory though. A lot of the work in the next few releases will be on
that.



On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it was
 on the order of 20 GBs for the given 10 GB of input data. Also, I can say
 that when we were experimenting with persist(DISK_ONLY), the size of all
 RDDs on disk was around 200 GB, which gives a sense of overall transient
 memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production. In
 my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
  Daniel Siegmann, Software Developer
 Velos
  Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I think we're missing the point a bit. Everything was actually flowing
through smoothly and in a reasonable time. Until it reached the last two
tasks (out of over a thousand in the final stage alone), at which point it
just fell into a coma. Not so much as a cranky message in the logs.

I don't know *why* that happened. Maybe it isn't the overall amount of
data, but something I'm doing wrong with my flow. In any case, improvements
to diagnostic info would probably be helpful.

I look forward to the next release. :-)


On Tue, Jul 8, 2014 at 3:47 PM, Reynold Xin r...@databricks.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able to
 handle much, much larger workloads.

 I've personally run a workload that shuffled 300 TB of data. I've also ran
 something that shuffled 5TB/node and stuffed my disks fairly full that the
 file system is close to breaking.

 We can definitely do a better job in Spark to make it output more
 meaningful diagnosis and more robust with partitions of data that don't fit
 in memory though. A lot of the work in the next few releases will be on
 that.



 On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more natural
 to me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly
 - but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
  Daniel Siegmann, Software Developer
 Velos
  Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io





-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW 

Re: Comparative study

2014-07-08 Thread Aaron Davidson

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able to
 handle much, much larger workloads.


+1 @Reynold

Spark can handle big big data. There are known issues with informing the
user about what went wrong and how to fix it that we're actively working
on, but the first impulse when a job fails should be what did I do wrong
rather than Spark can't handle this workload. Messaging is a huge part in
making this clear -- getting things like a job hanging or an out of memory
error can be very difficult to debug, and improving this is one of our
highest priorties.


On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able to
 handle much, much larger workloads.

 I've personally run a workload that shuffled 300 TB of data. I've also ran
 something that shuffled 5TB/node and stuffed my disks fairly full that the
 file system is close to breaking.

 We can definitely do a better job in Spark to make it output more
 meaningful diagnosis and more robust with partitions of data that don't fit
 in memory though. A lot of the work in the next few releases will be on
 that.



 On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more natural
 to me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly
 - but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --
  Daniel Siegmann, Software Developer
 Velos
  Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: 

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Aaron,

I don't think anyone was saying Spark can't handle this data size, given
testimony from the Spark team, Bizo, etc., on large datasets. This has kept
us trying different things to get our flow to work over the course of
several weeks.

Agreed that the first instinct should be what did I do wrong.

I believe that is what every person facing this issue has done, in reaching
out to the user group repeatedly over the course of the few of months that
I've been active here. I also know other companies (all experienced with
large production datasets on other platforms) facing the same types of
issues - flows that run on subsets of data but not the whole production set.

So I think, as you are saying, it points to the need for further
diagnostics. And maybe also some type of guidance on typical issues with
different types of datasets (wide rows, narrow rows, etc.), flow
topologies. etc.? Hard to tell where we are going wrong right now. We've
tried many things over the course of 6 weeks or so.

I tried to look for the professional services link on databricks.com but
didn't find it. ;-) (jk).

-Suren



On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able to
 handle much, much larger workloads.


 +1 @Reynold

 Spark can handle big big data. There are known issues with informing the
 user about what went wrong and how to fix it that we're actively working
 on, but the first impulse when a job fails should be what did I do wrong
 rather than Spark can't handle this workload. Messaging is a huge part in
 making this clear -- getting things like a job hanging or an out of memory
 error can be very difficult to debug, and improving this is one of our
 highest priorties.


 On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able to
 handle much, much larger workloads.

 I've personally run a workload that shuffled 300 TB of data. I've also
 ran something that shuffled 5TB/node and stuffed my disks fairly full that
 the file system is close to breaking.

 We can definitely do a better job in Spark to make it output more
 meaningful diagnosis and more robust with partitions of data that don't fit
 in memory though. A lot of the work in the next few releases will be on
 that.



 On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I can
 say that when we were experimenting with persist(DISK_ONLY), the size of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores
 and 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more natural
 to me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that MapReduce
 will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
 would be better, but I haven't had the opportunity to try them. I find jobs
 tend to just hang forever for no apparent reason on large data sets (but
 smaller than what I push through MapReduce).

  I am hopeful the situation will improve - Spark is developing quickly
 - but if you have large amounts of data you should proceed with caution.

  Keep in mind there are some frameworks for Hadoop which can hide the
 ugly MapReduce with something very similar in form to Spark's API; e.g.
 Apache Crunch. So you might consider those as well.

  (Note: the above is with Spark 1.0.0.)



 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR
 implementation to 

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has
been rather raw.  The appeal of interactive, batch, and in between all
using more or less straight Scala is unarguable.  But the experience
of deploying Spark has been quite painful, mainly about gaps between
compile time and run time to the JVM, due to dependency conflicts,
having to use uber jars, Spark's own uber jar which includes some very
old libs, etc.

What's more, there's very little resources available to help.  Some
times I've been able to get help via public sources, but, more often
than not, it's been trial and error.  Enough that, despite Spark's
unmistakable appeal, we are seriously considering dropping it entirely
and just doing a classical Hadoop.

On 7/8/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
 Aaron,

 I don't think anyone was saying Spark can't handle this data size, given
 testimony from the Spark team, Bizo, etc., on large datasets. This has kept
 us trying different things to get our flow to work over the course of
 several weeks.

 Agreed that the first instinct should be what did I do wrong.

 I believe that is what every person facing this issue has done, in reaching
 out to the user group repeatedly over the course of the few of months that
 I've been active here. I also know other companies (all experienced with
 large production datasets on other platforms) facing the same types of
 issues - flows that run on subsets of data but not the whole production
 set.

 So I think, as you are saying, it points to the need for further
 diagnostics. And maybe also some type of guidance on typical issues with
 different types of datasets (wide rows, narrow rows, etc.), flow
 topologies. etc.? Hard to tell where we are going wrong right now. We've
 tried many things over the course of 6 weeks or so.

 I tried to look for the professional services link on databricks.com but
 didn't find it. ;-) (jk).

 -Suren



 On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able
 to
 handle much, much larger workloads.


 +1 @Reynold

 Spark can handle big big data. There are known issues with informing
 the
 user about what went wrong and how to fix it that we're actively working
 on, but the first impulse when a job fails should be what did I do
 wrong
 rather than Spark can't handle this workload. Messaging is a huge part
 in
 making this clear -- getting things like a job hanging or an out of
 memory
 error can be very difficult to debug, and improving this is one of our
 highest priorties.


 On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com wrote:

 Not sure exactly what is happening but perhaps there are ways to
 restructure your program for it to work better. Spark is definitely able
 to
 handle much, much larger workloads.

 I've personally run a workload that shuffled 300 TB of data. I've also
 ran something that shuffled 5TB/node and stuffed my disks fairly full
 that
 the file system is close to breaking.

 We can definitely do a better job in Spark to make it output more
 meaningful diagnosis and more robust with partitions of data that don't
 fit
 in memory though. A lot of the work in the next few releases will be on
 that.



 On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 I'll respond for Dan.

 Our test dataset was a total of 10 GB of input data (full production
 dataset for this particular dataflow would be 60 GB roughly).

 I'm not sure what the size of the final output data was but I think it
 was on the order of 20 GBs for the given 10 GB of input data. Also, I
 can
 say that when we were experimenting with persist(DISK_ONLY), the size
 of
 all RDDs on disk was around 200 GB, which gives a sense of overall
 transient memory usage with no persistence.

 In terms of our test cluster, we had 15 nodes. Each node had 24 cores
 and 2 workers each. Each executor got 14 GB of memory.

 -Suren



 On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  When you say large data sets, how large?
 Thanks


 On 07/07/2014 01:39 PM, Daniel Siegmann wrote:

  From a development perspective, I vastly prefer Spark to MapReduce.
 The MapReduce API is very constrained; Spark's API feels much more
 natural
 to me. Testing and local development is also very easy - creating a
 local
 Spark context is trivial and it reads local files. For your unit tests
 you
 can just have them create a local context and execute your flow with
 some
 test data. Even better, you can do ad-hoc work in the Spark shell and
 if
 you want that in your production code it will look exactly the same.

  Unfortunately, the picture isn't so rosy when it gets to production.
 In my experience, Spark simply doesn't scale to the volumes that
 MapReduce
 will handle. Not with a Standalone cluster anyway - maybe 

Re: Comparative study

2014-07-08 Thread Keith Simmons
Santosh,

To add a bit more to what Nabeel said, Spark and Impala are very different
tools.  Impala is *not* built on map/reduce, though it was built to replace
Hive, which is map/reduce based.  It has its own distributed query engine,
though it does load data from HDFS, and is part of the hadoop ecosystem.
 Impala really shines when your entire dataset fits into memory and your
processing can be expressed in terms of sql.  Paired with the column
oriented Parquet format, it can really scream with the right dataset.

Spark also has a SQL layer (formely shark, now more tightly integrated with
Spark), but at least for our dataset, Impala was faster.  However, Spark
has a fantastic and far more flexible programming model.  As has been
mentioned a few times in this thread, it has a better batch processing
model than map/reduce, it can do stream processing, and in the newest
release, it looks like it can even mix and match sql queries.  You do need
to be more aware of memory issues than map/reduce, since using more memory
is one of the primary sources of Sparks speed, but with that caveat, its a
great technology.  In our particular workflow, we're replacing map/reduce
with spark for our batch layer and using Impala for our query layer.

Daniel,

For what it's worth, we've had a bunch of hanging issues because the
garbage collector seems to get out of control.  The most effective
technique has been to dramatically increase the numPartition argument in
our various groupBy and cogroup calls which reduces the per-task memory
requirements.  We also reduced the memory used by the shuffler (
spark.shuffle.memoryFraction) and turned off RDD memory (since we don't
have any iterative algorithms).  Finally, using kryo delivered a hug
performance and memory boost (even without registering any custom
serializers).

Keith




On Tue, Jul 8, 2014 at 2:58 PM, Robert James srobertja...@gmail.com wrote:

 As a new user, I can definitely say that my experience with Spark has
 been rather raw.  The appeal of interactive, batch, and in between all
 using more or less straight Scala is unarguable.  But the experience
 of deploying Spark has been quite painful, mainly about gaps between
 compile time and run time to the JVM, due to dependency conflicts,
 having to use uber jars, Spark's own uber jar which includes some very
 old libs, etc.

 What's more, there's very little resources available to help.  Some
 times I've been able to get help via public sources, but, more often
 than not, it's been trial and error.  Enough that, despite Spark's
 unmistakable appeal, we are seriously considering dropping it entirely
 and just doing a classical Hadoop.

 On 7/8/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Aaron,
 
  I don't think anyone was saying Spark can't handle this data size, given
  testimony from the Spark team, Bizo, etc., on large datasets. This has
 kept
  us trying different things to get our flow to work over the course of
  several weeks.
 
  Agreed that the first instinct should be what did I do wrong.
 
  I believe that is what every person facing this issue has done, in
 reaching
  out to the user group repeatedly over the course of the few of months
 that
  I've been active here. I also know other companies (all experienced with
  large production datasets on other platforms) facing the same types of
  issues - flows that run on subsets of data but not the whole production
  set.
 
  So I think, as you are saying, it points to the need for further
  diagnostics. And maybe also some type of guidance on typical issues with
  different types of datasets (wide rows, narrow rows, etc.), flow
  topologies. etc.? Hard to tell where we are going wrong right now. We've
  tried many things over the course of 6 weeks or so.
 
  I tried to look for the professional services link on databricks.com but
  didn't find it. ;-) (jk).
 
  -Suren
 
 
 
  On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  Not sure exactly what is happening but perhaps there are ways to
  restructure your program for it to work better. Spark is definitely
 able
  to
  handle much, much larger workloads.
 
 
  +1 @Reynold
 
  Spark can handle big big data. There are known issues with informing
  the
  user about what went wrong and how to fix it that we're actively working
  on, but the first impulse when a job fails should be what did I do
  wrong
  rather than Spark can't handle this workload. Messaging is a huge part
  in
  making this clear -- getting things like a job hanging or an out of
  memory
  error can be very difficult to debug, and improving this is one of our
  highest priorties.
 
 
  On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin r...@databricks.com
 wrote:
 
  Not sure exactly what is happening but perhaps there are ways to
  restructure your program for it to work better. Spark is definitely
 able
  to
  handle much, much larger workloads.
 
  I've personally run a workload that shuffled 300 TB of 

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me. Testing and local development is also very easy - creating a local
Spark context is trivial and it reads local files. For your unit tests you
can just have them create a local context and execute your flow with some
test data. Even better, you can do ad-hoc work in the Spark shell and if
you want that in your production code it will look exactly the same.

Unfortunately, the picture isn't so rosy when it gets to production. In my
experience, Spark simply doesn't scale to the volumes that MapReduce will
handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
better, but I haven't had the opportunity to try them. I find jobs tend to
just hang forever for no apparent reason on large data sets (but smaller
than what I push through MapReduce).

I am hopeful the situation will improve - Spark is developing quickly - but
if you have large amounts of data you should proceed with caution.

Keep in mind there are some frameworks for Hadoop which can hide the ugly
MapReduce with something very similar in form to Spark's API; e.g. Apache
Crunch. So you might consider those as well.

(Note: the above is with Spark 1.0.0.)



On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote:

  Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR implementation
 to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


RE: Comparative study

2014-07-07 Thread santosh.viswanathan
Thanks Daniel for sharing this info.

Regards,
Santosh Karthikeyan

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Tuesday, July 08, 2014 1:10 AM
To: user@spark.apache.org
Subject: Re: Comparative study

From a development perspective, I vastly prefer Spark to MapReduce. The 
MapReduce API is very constrained; Spark's API feels much more natural to me. 
Testing and local development is also very easy - creating a local Spark 
context is trivial and it reads local files. For your unit tests you can just 
have them create a local context and execute your flow with some test data. 
Even better, you can do ad-hoc work in the Spark shell and if you want that in 
your production code it will look exactly the same.
Unfortunately, the picture isn't so rosy when it gets to production. In my 
experience, Spark simply doesn't scale to the volumes that MapReduce will 
handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be 
better, but I haven't had the opportunity to try them. I find jobs tend to just 
hang forever for no apparent reason on large data sets (but smaller than what I 
push through MapReduce).
I am hopeful the situation will improve - Spark is developing quickly - but if 
you have large amounts of data you should proceed with caution.
Keep in mind there are some frameworks for Hadoop which can hide the ugly 
MapReduce with something very similar in form to Spark's API; e.g. Apache 
Crunch. So you might consider those as well.
(Note: the above is with Spark 1.0.0.)


On Mon, Jul 7, 2014 at 11:07 AM, 
santosh.viswanat...@accenture.commailto:santosh.viswanat...@accenture.com 
wrote:
Hello Experts,

I am doing some comparative study on the below:

Spark vs Impala
Spark vs MapREduce . Is it worth migrating from existing MR implementation to 
Spark?


Please share your thoughts and expertise.


Thanks,
Santosh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.comhttp://www.accenture.com



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: 
www.velos.iohttp://www.velos.io


Re: Comparative study

2014-07-07 Thread Nabeel Memon
For Scala API on map/reduce (hadoop engine) there's a library called
Scalding. It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try Scalding.

However, Spark vs Impala doesn't make sense to me. It should've really been
Shark vs Impala. Both are SQL querying engines built on top of Spark and
Hadoop (map/reduce engine) respectively.


On Mon, Jul 7, 2014 at 4:06 PM, santosh.viswanat...@accenture.com wrote:

  Thanks Daniel for sharing this info.



 Regards,
 Santosh Karthikeyan



 *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
 *Sent:* Tuesday, July 08, 2014 1:10 AM
 *To:* user@spark.apache.org
 *Subject:* Re: Comparative study



 From a development perspective, I vastly prefer Spark to MapReduce. The
 MapReduce API is very constrained; Spark's API feels much more natural to
 me. Testing and local development is also very easy - creating a local
 Spark context is trivial and it reads local files. For your unit tests you
 can just have them create a local context and execute your flow with some
 test data. Even better, you can do ad-hoc work in the Spark shell and if
 you want that in your production code it will look exactly the same.

 Unfortunately, the picture isn't so rosy when it gets to production. In my
 experience, Spark simply doesn't scale to the volumes that MapReduce will
 handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
 better, but I haven't had the opportunity to try them. I find jobs tend to
 just hang forever for no apparent reason on large data sets (but smaller
 than what I push through MapReduce).

 I am hopeful the situation will improve - Spark is developing quickly -
 but if you have large amounts of data you should proceed with caution.

 Keep in mind there are some frameworks for Hadoop which can hide the ugly
 MapReduce with something very similar in form to Spark's API; e.g. Apache
 Crunch. So you might consider those as well.

 (Note: the above is with Spark 1.0.0.)





 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
 wrote:

 Hello Experts,



 I am doing some comparative study on the below:



 Spark vs Impala

 Spark vs MapREduce . Is it worth migrating from existing MR implementation
 to Spark?





 Please share your thoughts and expertise.





 Thanks,
 Santosh


  --


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com




 --

 Daniel Siegmann, Software Developer
 Velos

 Accelerating Machine Learning


 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Comparative study

2014-07-07 Thread Sean Owen
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote:

 For Scala API on map/reduce (hadoop engine) there's a library called
 Scalding. It's built on top of Cascading. If you have a huge dataset or
 if you consider using map/reduce engine for your job, for any reason, you
 can try Scalding.


PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs
on Spark too, not just M/R.


Re: Comparative study

2014-07-07 Thread Soumya Simanta


Daniel, 

Do you mind sharing the size of your cluster and the production data volumes ? 

Thanks
Soumya 

 On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
 
 From a development perspective, I vastly prefer Spark to MapReduce. The 
 MapReduce API is very constrained; Spark's API feels much more natural to me. 
 Testing and local development is also very easy - creating a local Spark 
 context is trivial and it reads local files. For your unit tests you can just 
 have them create a local context and execute your flow with some test data. 
 Even better, you can do ad-hoc work in the Spark shell and if you want that 
 in your production code it will look exactly the same.
 
 Unfortunately, the picture isn't so rosy when it gets to production. In my 
 experience, Spark simply doesn't scale to the volumes that MapReduce will 
 handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be 
 better, but I haven't had the opportunity to try them. I find jobs tend to 
 just hang forever for no apparent reason on large data sets (but smaller than 
 what I push through MapReduce).
 
 I am hopeful the situation will improve - Spark is developing quickly - but 
 if you have large amounts of data you should proceed with caution.
 
 Keep in mind there are some frameworks for Hadoop which can hide the ugly 
 MapReduce with something very similar in form to Spark's API; e.g. Apache 
 Crunch. So you might consider those as well.
 
 (Note: the above is with Spark 1.0.0.)
 
 
 
 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote:
 Hello Experts,
 
  
 
 I am doing some comparative study on the below:
 
  
 
 Spark vs Impala
 
 Spark vs MapREduce . Is it worth migrating from existing MR implementation 
 to Spark?
 
  
 
  
 
 Please share your thoughts and expertise.
 
  
 
  
 
 Thanks,
 Santosh
 
 
 
 This message is for the designated recipient only and may contain 
 privileged, proprietary, or otherwise confidential information. If you have 
 received it in error, please notify the sender immediately and delete the 
 original. Any other use of the e-mail by you is prohibited. Where allowed by 
 local law, electronic communications with Accenture and its affiliates, 
 including e-mail and instant messaging (including content), may be scanned 
 by our systems for the purposes of information security and assessment of 
 internal compliance with Accenture policy. 
 __
 
 www.accenture.com
 
 
 
 -- 
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning
 
 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io