Re: Sourcing data from RedShift

2014-11-18 Thread Gary Malouf
Hi guys, We ultimately needed to add 8 ec2 xl's to get better performance. As was suspected, we could not fit all the data into ram. This worked great with files sized around 100-350MB in size as our initial export task produced. Unfortunately, for the partition settings that we were able to

Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)

Re: Sourcing data from RedShift

2014-11-14 Thread Michael Armbrust
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges,

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote: For each node, if the CSV reader is implemented efficiently, you should be able to hit at least half of the theoretical network bandwidth, which is about