Hi guys,
We ultimately needed to add 8 ec2 xl's to get better performance. As was
suspected, we could not fit all the data into ram.
This worked great with files sized around 100-350MB in size as our initial
export task produced. Unfortunately, for the partition settings that we
were able to
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark. What is the path/url format one uses to pull data from
there? (This is in reference to using the
https://github.com/mengxr/redshift-input-format)
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data. Xiangrui can correct me if I'm wrong
though.
On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote:
We have a bunch of data in RedShift tables that we'd like to pull in
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that. Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.
We had been using 12 m3.xlarges,
I'll try this out and follow up with what I find.
On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote:
For each node, if the CSV reader is implemented efficiently, you should be
able to hit at least half of the theoretical network bandwidth, which is
about