I'm considering whether or not it is worth introducing Spark at my new
company. The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).
I'm wondering at which point it is worth the overhead of adding the Spark
infrastructure (deployment scripts, monitoring,
at 8:40 AM, Gary Malouf malouf.g...@gmail.com
wrote:
I'm considering whether or not it is worth introducing Spark at my new
company. The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).
Will it ever become Hadoop size? Looking at the overhead of running
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.
Thanks,
Gary
On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi
On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when
trying to use GraphX. The cost of repartitioning the data is really high
for us (lots of network traffic) which is killing the job performance.
I understand the bug was reverted to stabilize unit tests, but frankly it
Has anyone else received this type of error? We are not sure what the
issue is nor how to correct it to get our job to complete...
right now is not
reasonable for us but this bug prevents solving the issue.
On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf malouf.g...@gmail.com wrote:
I'll try this out and follow up with what I find.
On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com
wrote:
For each node
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark. What is the path/url format one uses to pull data from
there? (This is in reference to using the
https://github.com/mengxr/redshift-input-format)
Armbrust mich...@databricks.com
wrote:
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data. Xiangrui can correct me if I'm wrong
though.
On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com
wrote:
We have a bunch of data
the data
to be efficient, you may need a larger cluster or change the storage level
to MEMORY_AND_DISK.
-Xiangrui
On Nov 14, 2014, at 5:32 PM, Gary Malouf malouf.g...@gmail.com wrote:
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that. Unfortunately, we've experienced
Cloudera had a blog post about this in August 2013:
http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
Has anyone been using this in production - curious as to if it made a
significant difference from a Spark perspective.
I have a use case for our data in HDFS that involves sorting chunks of data
into time series format by a specific characteristic and doing computations
from that. At large scale, what is the most efficient way to do this?
Obviously, having the data sharded by that characteristic would make the
We have our quantitative team using Spark as part of their daily work. One
of the more common problems we run into is that people unintentionally
leave their shells open throughout the day. This eats up memory in the
cluster and causes others to have limited resources to run their jobs.
With
We are probably still the minority, but our analytics platform based on
Spark + HDFS does not have map/reduce installed. I'm wondering if there is
a distcp equivalent that leverages Spark to do the work.
Our team is trying to find the best way to do cross-datacenter replication
of our HDFS data
My company is leaning towards moving much of their analytics work from our
own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been
the internal advocate for using Spark for analytics, but a number of good
points have been brought up to me. The reasons being pushed are:
-
changes this (
https://github.com/apache/spark/commit/09f7e4587bbdf74207d2629e8c1314f93d865999)
in that you can now manually configure all ports and only open up the ones
you configured. This will be available in Spark 1.1.
-Andrew
2014-08-06 8:29 GMT-07:00 Gary Malouf malouf.g...@gmail.com
I have a few questions about managing Spark memory:
1) In a standalone setup, is their any cpu prioritization across users
running jobs? If so, what is the behavior here?
2) With Spark 1.1, users will more easily be able to run drivers/shells
from remote locations that do not cause firewall
regards,
Ron Daniel, Jr.
Director, Elsevier Labs
r.dan...@elsevier.com
mobile: +1 619 208 3064
*From:* Gary Malouf [mailto:malouf.g...@gmail.com]
*Sent:* Wednesday, August 06, 2014 12:35 PM
*To:* Daniel, Ronald (ELS-SDG)
*Subject:* Re: Regarding tooling/performance vs RedShift
Also, regarding something like redshift not having MLlib built in, much of
that could be done on the derived results.
On Aug 6, 2014 4:07 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Mostly I was
PM, Gary Malouf malouf.g...@gmail.com wrote:
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well. Looking at the Mesos slave logs, I noticed:
ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo
I am aware that today PySpark can not load sequence files directly. Are
there work-arounds people are using (short of duplicating all the data to
text files) for accessing this data?
Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)? We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.
Go to expedia/orbitz and look for hotels in the union square neighborhood.
In my humble opinion having visited San Francisco, it is worth any extra
cost to be as close as possible to the conference vs having to travel from
other parts of the city.
On Tue, May 27, 2014 at 9:36 AM, Gerard Maas
For what it is worth, our team here at
MediaCrossinghttp://mediacrossing.com has
been using the Spark/Mesos combination since last summer with much success
(low operations overhead, high developer performance).
IMO, Hadoop is overcomplicated from both a development and operations
perspective so I
23 matches
Mail list logo