Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.

Thanks,

Gary

On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer  wrote:

> Hi
>
> On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf 
> wrote:
>
>> The honest answer is that it is unclear to me at this point.  I guess
>> what I am really wondering is if there are cases where one would find it
>> beneficial to use Spark against one or more RDBs?
>>
>
> Well, RDBs are all about *storage*, while Spark is about *computation*. If
> you have a very expensive computation (that can be parallelized in some
> way), then you might want to use Spark, even though your data lives in an
> ordinary RDB. Think raytracing, where you do something "for every pixel in
> the output image" and you could get your scene description from a database,
> write the result to a database, but use Spark to do two minutes of
> calculation for every pixel in parallel (or so).
>
> Tobias
>
>
>
>


Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
The honest answer is that it is unclear to me at this point.  I guess what
I am really wondering is if there are cases where one would find it
beneficial to use Spark against one or more RDBs?

On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer  wrote:

> Gary,
>
> On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf 
> wrote:
>
>> I'm considering whether or not it is worth introducing Spark at my new
>> company.  The data is no-where near Hadoop size at this point (it sits in
>> an RDS Postgres cluster).
>>
>
> Will it ever become "Hadoop size"? Looking at the overhead of running even
> a simple Hadoop setup (securely and with good performance, given about 1e6
> configuration parameters), I think it makes sense to stay in non-Hadoop
> mode as long as possible. People may disagree ;-)
>
> Tobias
>
> PS. You may also want to have a look at
> http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
>
>


Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
I'm considering whether or not it is worth introducing Spark at my new
company.  The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).

I'm wondering at which point it is worth the overhead of adding the Spark
infrastructure (deployment scripts, monitoring, etc).


Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Gary Malouf
Has anyone else received this type of error?  We are not sure what the
issue is nor how to correct it to get our job to complete...


GraphX bug re-opened

2014-11-19 Thread Gary Malouf
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when
trying to use GraphX.  The cost of repartitioning the data is really high
for us (lots of network traffic) which is killing the job performance.

I understand the bug was reverted to stabilize unit tests, but frankly it
makes it very hard to tune Spark applications with the limits this puts on
someone.  What is the process to get fixing this prioritized if we do not
have the cycles to do it ourselves?


Re: Sourcing data from RedShift

2014-11-18 Thread Gary Malouf
Hi guys,

We ultimately needed to add 8 ec2 xl's to get better performance.  As was
suspected, we could not fit all the data into ram.

This worked great with files sized around 100-350MB in size as our initial
export task produced.  Unfortunately, for the partition settings that we
were able to get to work with GraphX (unable to change parallelism due to
bug), we are unable to keep writing files at this size - our output ends up
being closer to 1GB per file.

As a result, our job seems to struggle working with a 100GB worth of these
files.  We are in a rough spot because upgrading Spark right now is not
reasonable for us but this bug prevents solving the issue.

On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf  wrote:

> I'll try this out and follow up with what I find.
>
> On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng 
> wrote:
>
>> For each node, if the CSV reader is implemented efficiently, you should
>> be able to hit at least half of the theoretical network bandwidth, which is
>> about 60MB/second/node. So if you just do counting, the expect time should
>> be within 3 minutes.
>>
>> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use
>> the default spark.storage.memoryFraction, it can barely cache 100GB of
>> data, not considering the overhead. So if your operation need to cache the
>> data to be efficient, you may need a larger cluster or change the storage
>> level to MEMORY_AND_DISK.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 5:32 PM, Gary Malouf  wrote:
>>
>> Hmm, we actually read the CSV data in S3 now and were looking to avoid
>> that.  Unfortunately, we've experienced dreadful performance reading 100GB
>> of text data for a job directly from S3 - our hope had been connecting
>> directly to Redshift would provide some boost.
>>
>> We had been using 12 m3.xlarges, but increasing default parallelism (to
>> 2x # of cpus across cluster) and increasing partitions during reading did
>> not seem to help.
>>
>> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng 
>> wrote:
>>
>>> Michael is correct. Using direct connection to dump data would be slow
>>> because there is only a single connection. Please use UNLOAD with ESCAPE
>>> option to dump the table to S3. See instructions at
>>>
>>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>>
>>> And then load them back using the redshift input format we wrote:
>>> https://github.com/databricks/spark-redshift (we moved the
>>> implementation to github/databricks). Right now all columns are loaded as
>>> string columns, and you need to do type casting manually. We plan to add a
>>> parser that can translate Redshift table schema directly to Spark SQL
>>> schema, but no ETA yet.
>>>
>>> -Xiangrui
>>>
>>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust 
>>> wrote:
>>>
>>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>>> though.
>>>
>>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf 
>>> wrote:
>>>
>>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>>> during job runs to Spark.  What is the path/url format one uses to pull
>>>> data from there?  (This is in reference to using the
>>>> https://github.com/mengxr/redshift-input-format)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>


Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
I'll try this out and follow up with what I find.

On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng  wrote:

> For each node, if the CSV reader is implemented efficiently, you should be
> able to hit at least half of the theoretical network bandwidth, which is
> about 60MB/second/node. So if you just do counting, the expect time should
> be within 3 minutes.
>
> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the
> default spark.storage.memoryFraction, it can barely cache 100GB of data,
> not considering the overhead. So if your operation need to cache the data
> to be efficient, you may need a larger cluster or change the storage level
> to MEMORY_AND_DISK.
>
> -Xiangrui
>
> On Nov 14, 2014, at 5:32 PM, Gary Malouf  wrote:
>
> Hmm, we actually read the CSV data in S3 now and were looking to avoid
> that.  Unfortunately, we've experienced dreadful performance reading 100GB
> of text data for a job directly from S3 - our hope had been connecting
> directly to Redshift would provide some boost.
>
> We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
> # of cpus across cluster) and increasing partitions during reading did not
> seem to help.
>
> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng 
> wrote:
>
>> Michael is correct. Using direct connection to dump data would be slow
>> because there is only a single connection. Please use UNLOAD with ESCAPE
>> option to dump the table to S3. See instructions at
>>
>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>
>> And then load them back using the redshift input format we wrote:
>> https://github.com/databricks/spark-redshift (we moved the
>> implementation to github/databricks). Right now all columns are loaded as
>> string columns, and you need to do type casting manually. We plan to add a
>> parser that can translate Redshift table schema directly to Spark SQL
>> schema, but no ETA yet.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust 
>> wrote:
>>
>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>> though.
>>
>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf 
>> wrote:
>>
>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>> during job runs to Spark.  What is the path/url format one uses to pull
>>> data from there?  (This is in reference to using the
>>> https://github.com/mengxr/redshift-input-format)
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>


Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that.  Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.

We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
# of cpus across cluster) and increasing partitions during reading did not
seem to help.

On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng  wrote:

> Michael is correct. Using direct connection to dump data would be slow
> because there is only a single connection. Please use UNLOAD with ESCAPE
> option to dump the table to S3. See instructions at
>
> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>
> And then load them back using the redshift input format we wrote:
> https://github.com/databricks/spark-redshift (we moved the implementation
> to github/databricks). Right now all columns are loaded as string columns,
> and you need to do type casting manually. We plan to add a parser that can
> translate Redshift table schema directly to Spark SQL schema, but no ETA
> yet.
>
> -Xiangrui
>
> On Nov 14, 2014, at 3:46 PM, Michael Armbrust 
> wrote:
>
> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
> command used to produce the data.  Xiangrui can correct me if I'm wrong
> though.
>
> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf 
> wrote:
>
>> We have a bunch of data in RedShift tables that we'd like to pull in
>> during job runs to Spark.  What is the path/url format one uses to pull
>> data from there?  (This is in reference to using the
>> https://github.com/mengxr/redshift-input-format)
>>
>>
>>
>>
>>
>
>


Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark.  What is the path/url format one uses to pull data from
there?  (This is in reference to using the
https://github.com/mengxr/redshift-input-format)


Short Circuit Local Reads

2014-09-17 Thread Gary Malouf
Cloudera had a blog post about this in August 2013:
http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/

Has anyone been using this in production - curious as to if it made a
significant difference from a Spark perspective.


Dealing with Time Series Data

2014-09-15 Thread Gary Malouf
I have a use case for our data in HDFS that involves sorting chunks of data
into time series format by a specific characteristic and doing computations
from that.  At large scale, what is the most efficient way to do this?
 Obviously, having the data sharded by that characteristic would make the
performance significantly better, but are there good tools Spark can do to
help us?


Re: ReduceByKey performance optimisation

2014-09-13 Thread Gary Malouf
You need something like:

val x: RDD[MyAwesomeObject]

x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l }

Does that make sense?


On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme 
wrote:

> I need to remove objects with duplicate key, but I need the whole object.
> Object which have the same key are not necessarily equal, though (but I can
> dump any on the ones that have identical key).
>
> 2014-09-13 12:50 GMT+02:00 Sean Owen :
>
>> If you are just looking for distinct keys, .keys.distinct() should be
>> much better.
>>
>> On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme 
>> wrote:
>> > Hello,
>> >
>> > I am facing performance issues with reduceByKey. In know that this
>> topic has
>> > already been covered but I did not really find answers to my question.
>> >
>> > I am using reduceByKey to remove entries with identical keys, using, as
>> > reduce function, (a,b) => a. It seems to be a relatively
>> straightforward use
>> > of reduceByKey, but performances on moderately big RDDs (some tens of
>> > millions of line) are very low, far from what you can reach with
>> mono-server
>> > computing packages like R for example.
>> >
>> > I have read on other threads on the topic that reduceByKey always
>> entirely
>> > shuffle the whole data. Is that true ? So it means that a custom
>> > partitionning could not help, right? In my case, I could relatively
>> easily
>> > grant that two identical keys would always be on the same partition,
>> > therefore an option could by to use mapPartition and reeimplement reduce
>> > locally, but I would like to know if there are simpler / more elegant
>> > alternatives.
>> >
>> > Thanks for your help,
>>
>
>


Dealing with Idle shells

2014-08-14 Thread Gary Malouf
We have our quantitative team using Spark as part of their daily work.  One
of the more common problems we run into is that people unintentionally
leave their shells open throughout the day.  This eats up memory in the
cluster and causes others to have limited resources to run their jobs.

With something like Hive or many client applications for SQL databases,
this is not really an issue but with Spark it's a significant inconvenience
to non-technical users.  Someone ends up having to post throughout the day
in chats to ensure people are using their shells or to 'get off the
cluster'.

Just wondering if anyone else has experienced this type of issue and how
they are managing it.  One idea we've had is to implement an 'idle timeout'
monitor for the shell, though on the surface this appears quite
challenging.


DistCP - Spark-based

2014-08-12 Thread Gary Malouf
We are probably still the minority, but our analytics platform based on
Spark + HDFS does not have map/reduce installed.  I'm wondering if there is
a distcp equivalent that leverages Spark to do the work.

Our team is trying to find the best way to do cross-datacenter replication
of our HDFS data to minimize the impact of outages/dc failure.


Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Also, regarding something like redshift not having MLlib built in, much of
that could be done on the derived results.
On Aug 6, 2014 4:07 PM, "Nicholas Chammas" 
wrote:

> On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)<
> r.dan...@elsevier.com> wrote:
>
>> Mostly I was just objecting to " Redshift does very well, but Shark is
>> on par or better than it in most of the tests " when that was not how I
>> read the results, and Redshift was on HDDs.
>
>
> My bad. You are correct; the only test Shark (mem) does better on is test
> #1 "Scan Query".
>
> And indeed, it would be good to see an updated benchmark with Redshift
> running on SSDs.
>
> Nick
>


Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Forgot to cc the mailing list :)


On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

>  Agreed. Being able to use SQL to make a table, pass it to a graph
> algorithm, pass that output to a machine learning algorithm, being able to
> invoke user defined python functions, … are capabilities that far exceed
> what we can do with Redshift. The total performance will be much better,
> and the programmer productivity will be much better, even if the SQL
> portion is not quite as fast.  Mostly I was just objecting to " Redshift
> does very well, but Shark is on par or better than it in most of the tests
> " when that was not how I read the results, and Redshift was on HDDs.
>
>
>
> BTW – What are you doing w/ Spark? We have a lot of text and other content
> that we want to mine, and are shifting onto Spark so we have the greater
> capabilities mentioned above.
>
>
>
>
>
> Best regards,
>
>
>
> Ron Daniel, Jr.
>
> Director, Elsevier Labs
>
> r.dan...@elsevier.com
>
> mobile: +1 619 208 3064
>
>
>
>
>
>
>
> *From:* Gary Malouf [mailto:malouf.g...@gmail.com]
> *Sent:* Wednesday, August 06, 2014 12:35 PM
> *To:* Daniel, Ronald (ELS-SDG)
>
> *Subject:* Re: Regarding tooling/performance vs RedShift
>
>
>
> Hi Ronald,
>
>
>
> In my opinion, the performance just has to be 'close' to make that piece
> irrelevant.  I think the real issue comes down to tooling and the ease of
> connecting their various python tools from the office to results coming out
> of Spark/other solution in 'the cloud'.
>
>
>
>
>
> On Wed, Aug 6, 2014 at 1:43 PM, Daniel, Ronald (ELS-SDG) <
> r.dan...@elsevier.com> wrote:
>
> Just to point out that the benchmark you point to has Redshift running on
> HDD machines instead of SSD, and it is still faster than Shark in all but
> one case.
>
>
>
> Like Gary, I'm also interested in replacing something we have on Redshift
> with Spark SQL, as it will give me much greater capability to process
> things. I'm willing to sacrifice some performance for the greater
> capability. But it would be nice to see the benchmark updated with Spark
> SQL, and with a more competitive configuration of Redshift.
>
>
>
> Best regards, and keep up the great work!
>
>
>
> Ron
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Wednesday, August 06, 2014 9:30 AM
> *To:* Gary Malouf
> *Cc:* user
>
>
> *Subject:* Re: Regarding tooling/performance vs RedShift
>
>
>
> 1) We get tooling out of the box from RedShift (specifically, stable JDBC
> access) - Spark we often are waiting for devops to get the right combo of
> tools working or for libraries to support sequence files.
>
>
>
> The arguments about JDBC access and simpler setup definitely make sense.
> My first non-trivial Spark application was actually an ETL process that
> sliced and diced JSON + tabular data and then loaded it into Redshift. From
> there on you got all the benefits of your average C-store database, plus
> the added benefit of Amazon managing many annoying setup and admin details
> for your Redshift cluster.
>
>
>
> One area I'm looking forward to seeing Spark SQL excel at is offering fast
> JDBC access to "raw" data--i.e. directly against S3 / HDFS; no ETL
> required. For easy and flexible data exploration, I don't think you can
> beat that with a C-store that you have to ETL stuff into.
>
>
>
> 2) There is a belief that for many of our queries (assumed to often be
> joins) a columnar database will perform orders of magnitude better.
>
>
>
> This is definitely a "it depends" statement, but there is a detailed
> benchmark here <https://amplab.cs.berkeley.edu/benchmark/> comparing
> Shark, Redshift, and other systems. Have you seen it? Redshift does very
> well, but Shark is on par or better than it in most of the tests. Of
> course, going forward we'll want to see Spark SQL match this kind of
> performance, and that remains to be seen.
>
>
>
> Nick
>
>
>
>
>
> On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf 
> wrote:
>
> My company is leaning towards moving much of their analytics work from our
> own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
> the internal advocate for using Spark for analytics, but a number of good
> points have been brought up to me.  The reasons being pushed are:
>
>
>
> - RedShift exposes a jdbc interface out of the box (no devops work there)
> and data looks and feels like it is in a normal sql database.  They wan

Spark memory management

2014-08-06 Thread Gary Malouf
I have a few questions about managing Spark memory:

1) In a standalone setup, is their any cpu prioritization across users
running jobs?  If so, what is the behavior here?

2) With Spark 1.1, users will more easily be able to run drivers/shells
from remote locations that do not cause firewall headaches.  Is there a way
to kill an individual user's job from the console without killing workers?
 We are in Mesos and are not aware of an easy way to handle this, but I
imagine standalone mode may handle this.


Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
This will be awesome - it's been one of the major issues for our analytics
team as they hope to use their own python libraries.


On Wed, Aug 6, 2014 at 2:40 PM, Andrew Or  wrote:

> Hi Gary,
>
> This has indeed been a limitation of Spark, in that drivers and executors
> use random ephemeral ports to talk to each other. If you are submitting a
> Spark job from your local machine in client mode (meaning, the driver runs
> on your machine), you will need to open up all TCP ports from your worker
> machines, a requirement that is not super secure. However, a very recent
> commit changes this (
> https://github.com/apache/spark/commit/09f7e4587bbdf74207d2629e8c1314f93d865999)
> in that you can now manually configure all ports and only open up the ones
> you configured. This will be available in Spark 1.1.
>
> -Andrew
>
>
> 2014-08-06 8:29 GMT-07:00 Gary Malouf :
>
> We have Spark 1.0.1 on Mesos deployed as a cluster in EC2.  Our Devops
>> lead tells me that Spark jobs can not be submitted from local machines due
>> to the complexity of opening the right ports to the world etc.
>>
>> Are other people running the shell locally in a production environment?
>>
>
>


Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
My company is leaning towards moving much of their analytics work from our
own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
the internal advocate for using Spark for analytics, but a number of good
points have been brought up to me.  The reasons being pushed are:

- RedShift exposes a jdbc interface out of the box (no devops work there)
and data looks and feels like it is in a normal sql database.  They want
this out of the box from Spark, no trying to figure out which version
matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
theoretically supports this but there have been release issues our team has
battled to date that erode the trust.

- Complaints around challenges we have faced running a spark shell locally
against a cluster in EC2.  It is partly a devops issue of deploying the
correct configurations to local machines, being able to kick a user off
hogging RAM, etc.

- "I want to be able to run queries from my python shell against your
sequence file data, roll it up and in the same shell leverage python graph
tools."  - I'm not very familiar with the Python setup, but I believe by
being able to run locally AND somehow add custom libraries to be accessed
from PySpark this could be done.

- "Joins will perform much better (in RedShift) because it says it sorts
it's keys.  We cannot pre-compute all joins away."


Basically, their argument is two-fold:

1) We get tooling out of the box from RedShift (specifically, stable JDBC
access) - Spark we often are waiting for devops to get the right combo of
tools working or for libraries to support sequence files.

2) There is a belief that for many of our queries (assumed to often be
joins) a columnar database will perform orders of magnitude better.



Anyway, a test is being setup to compare the two on the performance side
but from a tools perspective it's hard to counter the issues that are
brought up.


Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
We have Spark 1.0.1 on Mesos deployed as a cluster in EC2.  Our Devops lead
tells me that Spark jobs can not be submitted from local machines due to
the complexity of opening the right ports to the world etc.

Are other people running the shell locally in a production environment?


Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
Maybe this is me misunderstanding the Spark system property behavior, but
I'm not clear why the class being loaded ends up having '/' rather than '.'
in it's fully qualified name.  When I tested this out locally, the '/' were
preventing the class from being loaded.


On Fri, Jul 25, 2014 at 2:27 PM, Gary Malouf  wrote:

> After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
> well.  Looking at the Mesos slave logs, I noticed:
>
> ERROR KryoSerializer: Failed to run spark.kryo.registrator
> java.lang.ClassNotFoundException:
> com/mediacrossing/verrazano/kryo/MxDataRegistrator
>
> My spark-env.sh has the following when I run the Spark Shell:
>
> export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
>
> export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters
>
> export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar
>
>
> # -XX:+UseCompressedOops must be disabled to use more than 32GB RAM
>
> SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
> -Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
>  -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
> -Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
> -Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"
>
>
> I was able to verify that our custom jar was being copied to each worker,
> but for some reason it is not finding my registrator class.  Is anyone else
> struggling with Kryo on 1.0.x branch?
>


Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well.  Looking at the Mesos slave logs, I noticed:

ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo/MxDataRegistrator

My spark-env.sh has the following when I run the Spark Shell:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters

export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar


# -XX:+UseCompressedOops must be disabled to use more than 32GB RAM

SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
-Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
-Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"


I was able to verify that our custom jar was being copied to each worker,
but for some reason it is not finding my registrator class.  Is anyone else
struggling with Kryo on 1.0.x branch?


Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Gary Malouf
I am aware that today PySpark can not load sequence files directly.  Are
there work-arounds people are using (short of duplicating all the data to
text files) for accessing this data?


SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf
Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)?  We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.


Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Gary Malouf
Go to expedia/orbitz and look for hotels in the union square neighborhood.
 In my humble opinion having visited San Francisco, it is worth any extra
cost to be as close as possible to the conference vs having to travel from
other parts of the city.


On Tue, May 27, 2014 at 9:36 AM, Gerard Maas  wrote:

> +1
>
>
> On Tue, May 27, 2014 at 3:22 PM, Pierre B <
> pierre.borckm...@realimpactanalytics.com> wrote:
>
>> Hi everyone!
>>
>> Any recommendation anyone?
>>
>>
>> Pierre
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Summit-2014-Hotel-suggestions-tp5457p6424.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: is Mesos falling out of favor?

2014-05-11 Thread Gary Malouf
For what it is worth, our team here at
MediaCrossing has
been using the Spark/Mesos combination since last summer with much success
(low operations overhead, high developer performance).

IMO, Hadoop is overcomplicated from both a development and operations
perspective so I am looking to lower our dependencies on it, not increase
them.  Our stack currently includes:


   - Spark 0.9.1
   - Mesos 0.17
   - Chronos
   - HDFS (CDH 5.0-mr1)
   - Flume 1.4.0
   - ZooKeeper
   - Cassandra 2.0 (key-value store alternative to HBase)
   - Storm 0.9 (we prefer today to Spark Streaming)

We've used Shark in the past as well, but since most of us prefer the Spark
Shell we have not been maintaining it.

Using Mesos to run Spark allows for us to optimize our available resources
(CPU + RAM currently ) between Spark, Chronos and a number of other
services.  I see YARN as being heavily focused on MR2, but the reality is
we are using Spark in large part because writing MapReduce jobs is verbose,
hard to maintain and not performant (against Spark).  We have the advantage
of not having any real legacy Map/Reduce jobs to maintain, so that
consideration does not come into play.

Finally, I am a believer that for the long term direction of our company,
the Berkeley stack  will serve us
best.  Leveraging Mesos and Spark from the onset paves the way for this.


On Sun, May 11, 2014 at 1:28 PM, Paco Nathan  wrote:

> That's FUD. Tracking the Mesos and Spark use cases, there are very large
> production deployments of these together. Some are rather private but
> others are being surfaced. IMHO, one of the most amazing case studies is
> from Christina Delimitrou http://youtu.be/YpmElyi94AA
>
> For a tutorial, use the following but upgrade it to latest production for
> Spark. There was a related O'Reilly webcast and Strata tutorial as well:
> http://mesosphere.io/learn/run-spark-on-mesos/
>
> FWIW, I teach "Intro to Spark" with sections on CM4, YARN, Mesos, etc.
> Based on lots of student experiences, Mesos is clearly the shortest path to
> deploying a Spark cluster if you want to leverage the robustness,
> multi-tenancy for mixed workloads, less ops overhead, etc., that show up
> repeatedly in the use case analyses.
>
> My opinion only and not that of any of my clients: "Don't believe the FUD
> from YHOO unless you really want to be stuck in 2009."
>
>
> On Wed, May 7, 2014 at 8:30 AM, deric  wrote:
>
>> I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
>> distributing Spark as a binary package.
>>
>> For running examples with `./bin/run-example ...` it works fine, however
>> tasks from spark-shell are getting lost.
>>
>> Error: Could not find or load main class
>> org.apache.spark.executor.MesosExecutorBackend
>>
>> which looks more like problem with sbin/spark-executor and missing paths
>> to
>> jar. Anyone encountered this error before?
>>
>> I guess Yahoo invested quite a lot of effort into YARN and Spark
>> integration
>> (moreover when Mahout is migrating to Spark there's much more interest in
>> Hadoop and Spark integration). If there would be some "Mesos company"
>> working on Spark - Mesos integration it could be at least on the same
>> level.
>>
>> I don't see any other reason why would be YARN better than Mesos,
>> personally
>> I like the latter, however I haven't checked YARN for a while, maybe
>> they've
>> made a significant progress. I think Mesos is more universal and flexible
>> than YARN.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


SparkR with Sequence Files

2014-04-10 Thread Gary Malouf
Has anyone been using SparkR to work with data from sequence files?  We use
protobuf throughout our system and are considering whether to try out
SparkR.


Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Gary Malouf
Today, our cluster setup is as follows:

Mesos 0.15,
CDH 4.2.1-MRV1,
Spark 0.9-pre-scala-2.10 off master build targeted at appropriate CDH4
version


We are looking to upgrade all of these in order to get protobuf 2.5 working
properly.  The question is, which 'Hadoop version build' of Spark 0.9 is
compatible with the HDFS from Hadoop 2.2 and Cloudera's CDH5 MRV1
installation?  Is there one?


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Gary Malouf
Can anyone verify the claims from Aureliano regarding the Akka dependency
protobuf collision?  Our team has a major need to upgrade to protobuf 2.5.0
up the pipe and Spark seems to be the blocker here.


On Fri, Mar 21, 2014 at 6:49 PM, Aureliano Buendia wrote:

>
>
>
> On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski <
> og...@plainvanillagames.com> wrote:
>
>>
>> On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote:
>>
>>> On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia  wrote:
>>>
 Is there a reason for spark using the older akka?




 On Sun, Mar 2, 2014 at 1:53 PM, 1esha  wrote:

 The problem is in akka remote. It contains files compiled with 2.4.*.
 When

 you run it with 2.5.* in classpath it fails like above.



 Looks like moving to akka 2.3 will solve this issue. Check this issue -

 https://www.assembla.com/spaces/akka/tickets/3154-use-
 protobuf-version-2-5-0#/activity/ticket:


 Is the solution to exclude the  2.4.*. dependency on protobuf or will
 thi produce more complications?

>>> I am not sure I remember what the context was around this but I run
>> 0.9.0 with hadoop 2.2.0 just fine.
>>
>
> The problem is that spark depends on an older version of akka, which
> depends on an older version of protobuf (2.4).
>
> This means people cannot use protobuf 2.5 with spark.
>
>
>> Ognen
>>
>
>