I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng <m...@databricks.com> wrote:
> For each node, if the CSV reader is implemented efficiently, you should be > able to hit at least half of the theoretical network bandwidth, which is > about 60MB/second/node. So if you just do counting, the expect time should > be within 3 minutes. > > Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the > default spark.storage.memoryFraction, it can barely cache 100GB of data, > not considering the overhead. So if your operation need to cache the data > to be efficient, you may need a larger cluster or change the storage level > to MEMORY_AND_DISK. > > -Xiangrui > > On Nov 14, 2014, at 5:32 PM, Gary Malouf <malouf.g...@gmail.com> wrote: > > Hmm, we actually read the CSV data in S3 now and were looking to avoid > that. Unfortunately, we've experienced dreadful performance reading 100GB > of text data for a job directly from S3 - our hope had been connecting > directly to Redshift would provide some boost. > > We had been using 12 m3.xlarges, but increasing default parallelism (to 2x > # of cpus across cluster) and increasing partitions during reading did not > seem to help. > > On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <m...@databricks.com> > wrote: > >> Michael is correct. Using direct connection to dump data would be slow >> because there is only a single connection. Please use UNLOAD with ESCAPE >> option to dump the table to S3. See instructions at >> >> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html >> >> And then load them back using the redshift input format we wrote: >> https://github.com/databricks/spark-redshift (we moved the >> implementation to github/databricks). Right now all columns are loaded as >> string columns, and you need to do type casting manually. We plan to add a >> parser that can translate Redshift table schema directly to Spark SQL >> schema, but no ETA yet. >> >> -Xiangrui >> >> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD >> command used to produce the data. Xiangrui can correct me if I'm wrong >> though. >> >> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <malouf.g...@gmail.com> >> wrote: >> >>> We have a bunch of data in RedShift tables that we'd like to pull in >>> during job runs to Spark. What is the path/url format one uses to pull >>> data from there? (This is in reference to using the >>> https://github.com/mengxr/redshift-input-format) >>> >>> >>> >>> >>> >> >> > >