CoHadoop Papers

2014-08-26 Thread Gary Malouf
One of my colleagues has been questioning me as to why Spark/HDFS makes no
attempts to try to co-locate related data blocks.  He pointed to this
paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
CoHadoop research and the performance improvements it yielded for
Map/Reduce jobs.

Would leveraging these ideas for writing data from Spark make sense/be
worthwhile?


Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
It appears support for this type of control over block placement is going
out in the next version of HDFS:
https://issues.apache.org/jira/browse/HDFS-2576


On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote:

 One of my colleagues has been questioning me as to why Spark/HDFS makes no
 attempts to try to co-locate related data blocks.  He pointed to this
 paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
 CoHadoop research and the performance improvements it yielded for
 Map/Reduce jobs.

 Would leveraging these ideas for writing data from Spark make sense/be
 worthwhile?





Re: CoHadoop Papers

2014-08-26 Thread Christopher Nguyen
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?

If the former, Spark does support copartitioning.

If the latter, it's an HDFS scope that's outside of Spark. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful contributions for its set of use cases.

Sent while mobile. Pls excuse typos etc.
On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:

 It appears support for this type of control over block placement is going
 out in the next version of HDFS:
 https://issues.apache.org/jira/browse/HDFS-2576


 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

  One of my colleagues has been questioning me as to why Spark/HDFS makes
 no
  attempts to try to co-locate related data blocks.  He pointed to this
  paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
  CoHadoop research and the performance improvements it yielded for
  Map/Reduce jobs.
 
  Would leveraging these ideas for writing data from Spark make sense/be
  worthwhile?
 
 
 



Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Christopher, can you expand on the co-partitioning support?

We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key.  Our analytics team wants to
do frequent joins across these different data-sets based on this key.  It
makes sense that if the data for each key across 'tables' was co-located on
the same server, shuffles could be minimized and ultimately performance
could be much better.

From reading the HDFS issue I posted before, the way is being paved for
implementing this type of behavior though there are a lot of complications
to make it work I believe.


On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote:

 Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?

 If the former, Spark does support copartitioning.

 If the latter, it's an HDFS scope that's outside of Spark. On that note,
 Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
 sure the paper makes useful contributions for its set of use cases.

 Sent while mobile. Pls excuse typos etc.
 On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:

 It appears support for this type of control over block placement is going
 out in the next version of HDFS:
 https://issues.apache.org/jira/browse/HDFS-2576


 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

  One of my colleagues has been questioning me as to why Spark/HDFS makes
 no
  attempts to try to co-locate related data blocks.  He pointed to this
  paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
 the
  CoHadoop research and the performance improvements it yielded for
  Map/Reduce jobs.
 
  Would leveraging these ideas for writing data from Spark make sense/be
  worthwhile?
 
 
 




Re: CoHadoop Papers

2014-08-26 Thread Michael Armbrust
It seems like there are two things here:
 - Co-locating blocks with the same keys to avoid network transfer.
 - Leveraging partitioning information to avoid a shuffle when data is
already partitioned correctly (even if those partitions aren't yet on the
same machine).

The former seems more complicated and probably requires the support from
Hadoop you linked to.  However, the latter might be easier as there is
already a framework for reasoning about partitioning and the need to
shuffle in the Spark SQL planner.


On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote:

 Christopher, can you expand on the co-partitioning support?

 We have a number of spark SQL tables (saved in parquet format) that all
 could be considered to have a common hash key.  Our analytics team wants to
 do frequent joins across these different data-sets based on this key.  It
 makes sense that if the data for each key across 'tables' was co-located on
 the same server, shuffles could be minimized and ultimately performance
 could be much better.

 From reading the HDFS issue I posted before, the way is being paved for
 implementing this type of behavior though there are a lot of complications
 to make it work I believe.


 On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com
 wrote:

  Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
 
  If the former, Spark does support copartitioning.
 
  If the latter, it's an HDFS scope that's outside of Spark. On that note,
  Hadoop does also make attempts to collocate data, e.g., rack awareness.
 I'm
  sure the paper makes useful contributions for its set of use cases.
 
  Sent while mobile. Pls excuse typos etc.
  On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
 
  It appears support for this type of control over block placement is
 going
  out in the next version of HDFS:
  https://issues.apache.org/jira/browse/HDFS-2576
 
 
  On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
   One of my colleagues has been questioning me as to why Spark/HDFS
 makes
  no
   attempts to try to co-locate related data blocks.  He pointed to this
   paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
  the
   CoHadoop research and the performance improvements it yielded for
   Map/Reduce jobs.
  
   Would leveraging these ideas for writing data from Spark make sense/be
   worthwhile?
  
  
  
 
 



Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Hi Michael,

I think once that work is into HDFS, it will be great to expose this
functionality via Spark.  This is something worth pursuing because it could
grant orders of magnitude perf improvements in cases when people need to
join data.

The second item would be very interesting, could yield significant
performance boosts.

Best,

Gary


On Tue, Aug 26, 2014 at 6:50 PM, Michael Armbrust mich...@databricks.com
wrote:

 It seems like there are two things here:
  - Co-locating blocks with the same keys to avoid network transfer.
  - Leveraging partitioning information to avoid a shuffle when data is
 already partitioned correctly (even if those partitions aren't yet on the
 same machine).

 The former seems more complicated and probably requires the support from
 Hadoop you linked to.  However, the latter might be easier as there is
 already a framework for reasoning about partitioning and the need to
 shuffle in the Spark SQL planner.


 On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

 Christopher, can you expand on the co-partitioning support?

 We have a number of spark SQL tables (saved in parquet format) that all
 could be considered to have a common hash key.  Our analytics team wants
 to
 do frequent joins across these different data-sets based on this key.  It
 makes sense that if the data for each key across 'tables' was co-located
 on
 the same server, shuffles could be minimized and ultimately performance
 could be much better.

 From reading the HDFS issue I posted before, the way is being paved for
 implementing this type of behavior though there are a lot of complications
 to make it work I believe.


 On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com
 wrote:

  Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
 
  If the former, Spark does support copartitioning.
 
  If the latter, it's an HDFS scope that's outside of Spark. On that note,
  Hadoop does also make attempts to collocate data, e.g., rack awareness.
 I'm
  sure the paper makes useful contributions for its set of use cases.
 
  Sent while mobile. Pls excuse typos etc.
  On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
 
  It appears support for this type of control over block placement is
 going
  out in the next version of HDFS:
  https://issues.apache.org/jira/browse/HDFS-2576
 
 
  On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
   One of my colleagues has been questioning me as to why Spark/HDFS
 makes
  no
   attempts to try to co-locate related data blocks.  He pointed to this
   paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
  the
   CoHadoop research and the performance improvements it yielded for
   Map/Reduce jobs.
  
   Would leveraging these ideas for writing data from Spark make
 sense/be
   worthwhile?