from:"\"Evan Chan\""

Re: more uniform exception handling?

2016-04-18 Thread Evan Chan

+1000.

Especially if the UI can help correlate exceptions, and we can reduce
some exceptions.

There are some exceptions which are in practice very common, such as
the nasty ClassNotFoundException, that most folks end up spending tons
of time debugging.


On Mon, Apr 18, 2016 at 12:16 PM, Reynold Xin  wrote:
> Josh's pull request on rpc exception handling got me to think ...
>
> In my experience, there have been a few things related exceptions that
> created a lot of trouble for us in production debugging:
>
> 1. Some exception is thrown, but is caught by some try/catch that does not
> do any logging nor rethrow.
> 2. Some exception is thrown, but is caught by some try/catch that does not
> do any logging, but do rethrow. But the original exception is now masked.
> 2. Multiple exceptions are logged at different places close to each other,
> but we don't know whether they are caused by the same problem or not.
>
>
> To mitigate some of the above, here's an idea ...
>
> (1) Create a common root class for all the exceptions (e.g. call it
> SparkException) used in Spark. We should make sure every time we catch an
> exception from a 3rd party library, we rethrow them as SparkException (a lot
> of places already do that). In SparkException's constructor, log the
> exception and the stacktrace.
>
> (2) SparkException has a monotonically increasing ID, and this ID appears in
> the exception error message (say at the end).
>
>
> I think (1) will eliminate most of the cases that an exception gets
> swallowed. The main downside I can think of is we might log an exception
> multiple times. However, I'd argue exceptions should be rare, and it is not
> that big of a deal to log them twice or three times. The unique ID (2) can
> help us correlate exceptions if they appear multiple times.
>
> Thoughts?
>
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using local-cluster mode for testing Spark-related projects

2016-04-17 Thread Evan Chan

Jon,  Thanks.   I think I've figured it out, actually.   It's really
simple, one needs to simply set spark.executor.extraClassPath to the
current value of the java class path (java.class.path system
property).   Also, to not use HiveContext, which gives errors about
initializing a Derby database multiple times.

On Sun, Apr 17, 2016 at 9:51 AM, Jon Maurer  wrote:
> Take a look at spark testing base.
> https://github.com/holdenk/spark-testing-base/blob/master/README.md
>
> On Apr 17, 2016 10:28 AM, "Evan Chan"  wrote:
>>
>> What I want to find out is how to run tests like Spark's with
>> local-cluster, just like that suite, but in your own projects.   Has
>> anyone done this?
>>
>> On Sun, Apr 17, 2016 at 5:37 AM, Takeshi Yamamuro 
>> wrote:
>> > Hi,
>> > Is this a bad idea to create `SparkContext` with a `local-cluster` mode
>> > by
>> > yourself like
>> >
>> > 'https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/ShuffleSuite.scala#L55'?
>> >
>> > // maropu
>> >
>> > On Sun, Apr 17, 2016 at 9:47 AM, Evan Chan 
>> > wrote:
>> >>
>> >> Hey folks,
>> >>
>> >> I'd like to use local-cluster mode in my Spark-related projects to
>> >> test Spark functionality in an automated way in a simulated local
>> >> cluster.The idea is to test multi-process things in a much easier
>> >> fashion than setting up a real cluster.   However, getting this up and
>> >> running in a separate project (I'm using Scala 2.10 and ScalaTest) is
>> >> nontrivial.   Does anyone have any suggestions to get up and running?
>> >>
>> >> This is what I've observed so far (I'm testing against 1.5.1, but
>> >> suspect this would apply equally to 1.6.x):
>> >>
>> >> - One needs to have a real Spark distro and point to it using
>> >> SPARK_HOME
>> >> - SPARK_SCALA_VERSION needs to be set
>> >> - One needs to manually inject jar paths, otherwise dependencies are
>> >> missing.  For example, build an assembly jar of all your deps.  Java
>> >> class directory hierarchies don't seem to work with the setJars(...).
>> >>
>> >> How does Spark's internal scripts make it possible to run
>> >> local-cluster mode and set up all the class paths correctly?   And, is
>> >> it possible to mimic this setup for external Spark projects?
>> >>
>> >> thanks,
>> >> Evan
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > ---
>> > Takeshi Yamamuro
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using local-cluster mode for testing Spark-related projects

2016-04-17 Thread Evan Chan

What I want to find out is how to run tests like Spark's with
local-cluster, just like that suite, but in your own projects.   Has
anyone done this?

On Sun, Apr 17, 2016 at 5:37 AM, Takeshi Yamamuro  wrote:
> Hi,
> Is this a bad idea to create `SparkContext` with a `local-cluster` mode by
> yourself like
> 'https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/ShuffleSuite.scala#L55'?
>
> // maropu
>
> On Sun, Apr 17, 2016 at 9:47 AM, Evan Chan  wrote:
>>
>> Hey folks,
>>
>> I'd like to use local-cluster mode in my Spark-related projects to
>> test Spark functionality in an automated way in a simulated local
>> cluster.The idea is to test multi-process things in a much easier
>> fashion than setting up a real cluster.   However, getting this up and
>> running in a separate project (I'm using Scala 2.10 and ScalaTest) is
>> nontrivial.   Does anyone have any suggestions to get up and running?
>>
>> This is what I've observed so far (I'm testing against 1.5.1, but
>> suspect this would apply equally to 1.6.x):
>>
>> - One needs to have a real Spark distro and point to it using SPARK_HOME
>> - SPARK_SCALA_VERSION needs to be set
>> - One needs to manually inject jar paths, otherwise dependencies are
>> missing.  For example, build an assembly jar of all your deps.  Java
>> class directory hierarchies don't seem to work with the setJars(...).
>>
>> How does Spark's internal scripts make it possible to run
>> local-cluster mode and set up all the class paths correctly?   And, is
>> it possible to mimic this setup for external Spark projects?
>>
>> thanks,
>> Evan
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
>
> --
> ---
> Takeshi Yamamuro

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Using local-cluster mode for testing Spark-related projects

2016-04-16 Thread Evan Chan

Hey folks,

I'd like to use local-cluster mode in my Spark-related projects to
test Spark functionality in an automated way in a simulated local
cluster.The idea is to test multi-process things in a much easier
fashion than setting up a real cluster.   However, getting this up and
running in a separate project (I'm using Scala 2.10 and ScalaTest) is
nontrivial.   Does anyone have any suggestions to get up and running?

This is what I've observed so far (I'm testing against 1.5.1, but
suspect this would apply equally to 1.6.x):

- One needs to have a real Spark distro and point to it using SPARK_HOME
- SPARK_SCALA_VERSION needs to be set
- One needs to manually inject jar paths, otherwise dependencies are
missing.  For example, build an assembly jar of all your deps.  Java
class directory hierarchies don't seem to work with the setJars(...).

How does Spark's internal scripts make it possible to run
local-cluster mode and set up all the class paths correctly?   And, is
it possible to mimic this setup for external Spark projects?

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Evan Chan

Hi folks,

Sorry to join the discussion late.  I had a look at the design doc
earlier in this thread, and it was not mentioned what types of
projects are the targets of this new "spark extras" ASF umbrella

Is the desire to have a maintained set of spark-related projects that
keep pace with the main Spark development schedule?  Is it just for
streaming connectors?  what about data sources, and other important
projects in the Spark ecosystem?

I'm worried that this would relegate spark-packages to third tier
status, and the promotion of a select set of committers, and the
project itself, to top level ASF status (a la Arrow) would create a
further split in the community.

-Evan

On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran  wrote:
>
>
>
>
>
> On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" 
>  wrote:
>
>>Yeah in support of this statement I think that my primary interest in
>>this Spark Extras and the good work by Luciano here is that anytime we
>>take bits out of a code base and “move it to GitHub” I see a bad precedent
>>being set.
>>
>>Creating this project at the ASF creates a synergy between *Apache Spark*
>>which is *at the ASF*.
>>
>>We welcome comments and as Luciano said, this is meant to invite and be
>>open to those in the Apache Spark PMC to join and help.
>>
>>Cheers,
>>Chris
>
> As one of the people named, here's my rationale:
>
> Throwing stuff into github creates that world of branches, and its no longer 
> something that could be managed through the ASF, where managed is: 
> governance, participation and a release process that includes auditing 
> dependencies, code-signoff, etc,
>
>
> As an example, there's a mutant hive JAR which spark uses, that's something 
> which currently evolved between my repo and Patrick Wendell's; now that Josh 
> Rosen has taken on the bold task of "trying to move spark and twill to Kryo 
> 3", he's going to own that code, and now the reference branch will move 
> somewhere else.
>
> In contrast, if there was an ASF location for this, then it'd be something 
> anyone with commit rights could maintain and publish
>
> (actually, I've just realised life is hard here as the hive is a fork of ASF 
> hive —really the spark branch should be a separate branch in Hive's own repo 
> ... But the concept is the same: those bits of the codebase which are core 
> parts of the spark project should really live in or near it)
>
>
> If everyone on the spark commit list gets write access to this extras repo, 
> moving things is straightforward. Release wise, things could/should be in 
> sync.
>
> If there's a risk, its the eternal problem of the contrib/ dir  Stuff 
> ends up there that never gets maintained. I don't see that being any worse 
> than if things were thrown to the wind of a thousand github repos: at least 
> now there'd be a central issue tracking location.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Spark Summit CFP - Tracks guidelines

2015-02-04 Thread Evan Chan

Hey guys,

Is there any guidance on what the different tracks for Spark Summit
West mean?  There are some new ones, like "Third Party Apps", which
seems like it would be similar to the "Use Cases".   Any further
guidance would be great.

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SparkSubmit.scala and stderr

2015-02-03 Thread Evan Chan

Why not just use SLF4J?

On Tue, Feb 3, 2015 at 2:22 PM, Reynold Xin  wrote:
> We can use ScalaTest's privateMethodTester also instead of exposing that.
>
> On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin  wrote:
>
>> Hi Jay,
>>
>> On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles  wrote:
>> > // Exposed for testing
>> > private[spark] var printStream: PrintStream = System.err
>>
>> > But as the comment states that it's for testing, maybe I'm
>> > misunderstanding its intent...
>>
>> The comment is there to tell someone reading the code that this field
>> is a `var` and not private just because test code (SparkSubmitSuite in
>> this case) needs to modify it, otherwise it wouldn't exist or would be
>> private. Similar in spirit to this annotation:
>>
>>
>> http://guava-libraries.googlecode.com/svn/tags/release09/javadoc/com/google/common/annotations/VisibleForTesting.html
>>
>> (Which I'd probably have used in this case, but is not really common
>> in Spark code.)
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Welcoming three new committers

2015-02-03 Thread Evan Chan

Congrats everyone!!!

On Tue, Feb 3, 2015 at 3:17 PM, Timothy Chen  wrote:
> Congrats all!
>
> Tim
>
>
>> On Feb 4, 2015, at 7:10 AM, Pritish Nawlakhe 
>>  wrote:
>>
>> Congrats and welcome back!!
>>
>>
>>
>> Thank you!!
>>
>> Regards
>> Pritish
>> Nirvana International Inc.
>>
>> Big Data, Hadoop, Oracle EBS and IT Solutions
>> VA - SWaM, MD - MBE Certified Company
>> prit...@nirvana-international.com
>> http://www.nirvana-international.com
>> Twitter: @nirvanainternat
>>
>> -Original Message-
>> From: Hari Shreedharan [mailto:hshreedha...@cloudera.com]
>> Sent: Tuesday, February 3, 2015 6:02 PM
>> To: Ted Yu
>> Cc: Nicholas Chammas; dev; Joseph Bradley; Cheng Lian; Matei Zaharia; Sean 
>> Owen
>> Subject: Re: Welcoming three new committers
>>
>> Congrats Cheng, Joseph and Owen! Well done!
>>
>>
>>
>>
>> Thanks, Hari
>>
>>> On Tue, Feb 3, 2015 at 2:55 PM, Ted Yu  wrote:
>>>
>>> Congratulations, Cheng, Joseph and Sean.
>>> On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas
>>> >>> wrote:
 Congratulations guys!

 On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia
 
 wrote:

> Hi all,
>
> The PMC recently voted to add three new committers: Cheng Lian,
> Joseph Bradley and Sean Owen. All three have been major
> contributors to Spark in the past year: Cheng on Spark SQL, Joseph
> on MLlib, and Sean on ML and
 many
> pieces throughout Spark Core. Join me in welcoming them as committers!
>
> Matei
> ---
> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
> additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

2015-02-01 Thread Evan Chan

It is true that you can persist SchemaRdds / DataFrames to disk via
Parquet, but a lot of time and inefficiencies is lost.   The in-memory
columnar cached representation is completely different from the
Parquet file format, and I believe there has to be a translation into
a Row (because ultimately Spark SQL traverses Row's -- even the
InMemoryColumnarTableScan has to then convert the columns into Rows
for row-based processing).   On the other hand, traditional data
frames process in a columnar fashion.   Columnar storage is good, but
nowhere near as good as columnar processing.

Another issue, which I don't know if it is solved yet, but it is
difficult for Tachyon to efficiently cache Parquet files without
understanding the file format itself.

I gave a talk at last year's Spark Summit on this topic.

I'm working on efforts to change this, however.  Shoot me an email at
velvia at gmail if you're interested in joining forces.

On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian  wrote:
> Yes, when a DataFrame is cached in memory, it's stored in an efficient
> columnar format. And you can also easily persist it on disk using Parquet,
> which is also columnar.
>
> Cheng
>
>
> On 1/29/15 1:24 PM, Koert Kuipers wrote:
>>
>> to me the word DataFrame does come with certain expectations. one of them
>> is that the data is stored columnar. in R data.frame internally uses a
>> list
>> of sequences i think, but since lists can have labels its more like a
>> SortedMap[String, Array[_]]. this makes certain operations very cheap
>> (such
>> as adding a column).
>>
>> in Spark the closest thing would be a data structure where per Partition
>> the data is also stored columnar. does spark SQL already use something
>> like
>> that? Evan mentioned "Spark SQL columnar compression", which sounds like
>> it. where can i find that?
>>
>> thanks
>>
>> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan 
>> wrote:
>>
>>> +1 having proper NA support is much cleaner than using null, at
>>> least the Java null.
>>>
>>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks 
>>> wrote:
>>>>
>>>> You've got to be a little bit careful here. "NA" in systems like R or
>>>
>>> pandas
>>>>
>>>> may have special meaning that is distinct from "null".
>>>>
>>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>>
>>>>
>>>>
>>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin 
>>>
>>> wrote:
>>>>>
>>>>> Isn't that just "null" in SQL?
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
>>>>> wrote:
>>>>>
>>>>>> I believe that most DataFrame implementations out there, like Pandas,
>>>>>> supports the idea of missing values / NA, and some support the idea of
>>>>>> Not Meaningful as well.
>>>>>>
>>>>>> Does Row support anything like that?  That is important for certain
>>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>>> but haven't looked into the details in a while.
>>>>>>
>>>>>> -Evan
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
>>>>>> wrote:
>>>>>>>
>>>>>>> It shouldn't change the data source api at all because data sources
>>>>>>
>>>>>> create
>>>>>>>
>>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>>>
>>>>>> (previously
>>>>>>>
>>>>>>> to SchemaRDD).
>>>>>>>
>>>>>>>
>>>>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>>>
>>>>>>> One thing that will break the data source API in 1.3 is the location
>>>>>>> of
>>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>>>
>>>>>> moved to
>>>>>>>
>>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>>> public
>>>>>>
>>>>>> APIs
>>>>>>>
>>>>>>> h

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Evan Chan

+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks  wrote:
> You've got to be a little bit careful here. "NA" in systems like R or pandas
> may have special meaning that is distinct from "null".
>
> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>
>
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:
>>
>> Isn't that just "null" in SQL?
>>
>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
>> wrote:
>>
>> > I believe that most DataFrame implementations out there, like Pandas,
>> > supports the idea of missing values / NA, and some support the idea of
>> > Not Meaningful as well.
>> >
>> > Does Row support anything like that?  That is important for certain
>> > applications.  I thought that Row worked by being a mutable object,
>> > but haven't looked into the details in a while.
>> >
>> > -Evan
>> >
>> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
>> > wrote:
>> > > It shouldn't change the data source api at all because data sources
>> > create
>> > > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > > to SchemaRDD).
>> > >
>> > >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >
>> > > One thing that will break the data source API in 1.3 is the location
>> > > of
>> > > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > > public
>> > APIs
>> > > have first class classes/objects defined in sql directly.
>> > >
>> > >
>> > >
>> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
>> > wrote:
>> > >>
>> > >> Hey guys,
>> > >>
>> > >> How does this impact the data sources API?  I was planning on using
>> > >> this for a project.
>> > >>
>> > >> +1 that many things from spark-sql / DataFrame is universally
>> > >> desirable and useful.
>> > >>
>> > >> By the way, one thing that prevents the columnar compression stuff in
>> > >> Spark SQL from being more useful is, at least from previous talks
>> > >> with
>> > >> Reynold and Michael et al., that the format was not designed for
>> > >> persistence.
>> > >>
>> > >> I have a new project that aims to change that.  It is a
>> > >> zero-serialisation, high performance binary vector library, designed
>> > >> from the outset to be a persistent storage friendly.  May be one day
>> > >> it can replace the Spark SQL columnar compression.
>> > >>
>> > >> Michael told me this would be a lot of work, and recreates parts of
>> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> > >>
>> > >> -Evan
>> > >>
>> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
>> > wrote:
>> > >> > Alright I have merged the patch (
>> > >> > https://github.com/apache/spark/pull/4173
>> > >> > ) since I don't see any strong opinions against it (as a matter of
>> > fact
>> > >> > most were for it). We can still change it if somebody lays out a
>> > strong
>> > >> > argument.
>> > >> >
>> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >> > 
>> > >> > wrote:
>> > >> >
>> > >> >> The type alias means your methods can specify either type and they
>> > will
>> > >> >> work. It's just another name for the same type. But Scaladocs and
>> > such
>> > >> >> will
>> > >> >> show DataFrame as the type.
>> > >> >>
>> > >> >> Matei
>> > >> >>
>> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >> >> dirceu.semigh...@gmail.com> wrote:
>> > >> >> >
>> > >> >> > Reynold,
>> > >> >> > But with

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan

Yeah, it's "null".   I was worried you couldn't represent it in Row
because of primitive types like Int (unless you box the Int, which
would be a performance hit).  Anyways, I'll take another look at the
Row API again  :-p

On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:
> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan  wrote:
>>
>> I believe that most DataFrame implementations out there, like Pandas,
>> supports the idea of missing values / NA, and some support the idea of
>> Not Meaningful as well.
>>
>> Does Row support anything like that?  That is important for certain
>> applications.  I thought that Row worked by being a mutable object,
>> but haven't looked into the details in a while.
>>
>> -Evan
>>
>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
>> > It shouldn't change the data source api at all because data sources
>> > create
>> > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > to SchemaRDD).
>> >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> >
>> > One thing that will break the data source API in 1.3 is the location of
>> > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
>> > APIs
>> > have first class classes/objects defined in sql directly.
>> >
>> >
>> >
>> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
>> > wrote:
>> >>
>> >> Hey guys,
>> >>
>> >> How does this impact the data sources API?  I was planning on using
>> >> this for a project.
>> >>
>> >> +1 that many things from spark-sql / DataFrame is universally
>> >> desirable and useful.
>> >>
>> >> By the way, one thing that prevents the columnar compression stuff in
>> >> Spark SQL from being more useful is, at least from previous talks with
>> >> Reynold and Michael et al., that the format was not designed for
>> >> persistence.
>> >>
>> >> I have a new project that aims to change that.  It is a
>> >> zero-serialisation, high performance binary vector library, designed
>> >> from the outset to be a persistent storage friendly.  May be one day
>> >> it can replace the Spark SQL columnar compression.
>> >>
>> >> Michael told me this would be a lot of work, and recreates parts of
>> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> >>
>> >> -Evan
>> >>
>> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
>> >> wrote:
>> >> > Alright I have merged the patch (
>> >> > https://github.com/apache/spark/pull/4173
>> >> > ) since I don't see any strong opinions against it (as a matter of
>> >> > fact
>> >> > most were for it). We can still change it if somebody lays out a
>> >> > strong
>> >> > argument.
>> >> >
>> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> >> > 
>> >> > wrote:
>> >> >
>> >> >> The type alias means your methods can specify either type and they
>> >> >> will
>> >> >> work. It's just another name for the same type. But Scaladocs and
>> >> >> such
>> >> >> will
>> >> >> show DataFrame as the type.
>> >> >>
>> >> >> Matei
>> >> >>
>> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >> >
>> >> >> > Reynold,
>> >> >> > But with type alias we will have the same problem, right?
>> >> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> >> > change
>> >> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> >> > implicit
>> >> >> > conversion between DataFrame and SchemaRDD
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xi

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan

I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
> It shouldn't change the data source api at all because data sources create
> RDD[Row], and that gets converted into a DataFrame automatically (previously
> to SchemaRDD).
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>
> One thing that will break the data source API in 1.3 is the location of
> types. Types were previously defined in sql.catalyst.types, and now moved to
> sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs
> have first class classes/objects defined in sql directly.
>
>
>
> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan  wrote:
>>
>> Hey guys,
>>
>> How does this impact the data sources API?  I was planning on using
>> this for a project.
>>
>> +1 that many things from spark-sql / DataFrame is universally
>> desirable and useful.
>>
>> By the way, one thing that prevents the columnar compression stuff in
>> Spark SQL from being more useful is, at least from previous talks with
>> Reynold and Michael et al., that the format was not designed for
>> persistence.
>>
>> I have a new project that aims to change that.  It is a
>> zero-serialisation, high performance binary vector library, designed
>> from the outset to be a persistent storage friendly.  May be one day
>> it can replace the Spark SQL columnar compression.
>>
>> Michael told me this would be a lot of work, and recreates parts of
>> Parquet, but I think it's worth it.  LMK if you'd like more details.
>>
>> -Evan
>>
>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
>> > Alright I have merged the patch (
>> > https://github.com/apache/spark/pull/4173
>> > ) since I don't see any strong opinions against it (as a matter of fact
>> > most were for it). We can still change it if somebody lays out a strong
>> > argument.
>> >
>> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > 
>> > wrote:
>> >
>> >> The type alias means your methods can specify either type and they will
>> >> work. It's just another name for the same type. But Scaladocs and such
>> >> will
>> >> show DataFrame as the type.
>> >>
>> >> Matei
>> >>
>> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> dirceu.semigh...@gmail.com> wrote:
>> >> >
>> >> > Reynold,
>> >> > But with type alias we will have the same problem, right?
>> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> > change
>> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> > implicit
>> >> > conversion between DataFrame and SchemaRDD
>> >> >
>> >> >
>> >> >
>> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> >> >
>> >> >> Dirceu,
>> >> >>
>> >> >> That is not possible because one cannot overload return types.
>> >> >>
>> >> >> SQLContext.parquetFile (and many other methods) needs to return some
>> >> type,
>> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >>
>> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
>> >> >> to
>> >> not
>> >> >> break source compatibility for Scala.
>> >> >>
>> >> >>
>> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >>
>> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
>> >> >>> in
>> >> the
>> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> DataFrame?
>> >> >>> With this, we don't impact in existing code for the next few
>> >> >>> releases.
>> >> >>>
>> >> >>

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan

Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff in
Spark SQL from being more useful is, at least from previous talks with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library, designed
from the outset to be a persistent storage friendly.  May be one day
it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
> Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
> ) since I don't see any strong opinions against it (as a matter of fact
> most were for it). We can still change it if somebody lays out a strong
> argument.
>
> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia 
> wrote:
>
>> The type alias means your methods can specify either type and they will
>> work. It's just another name for the same type. But Scaladocs and such will
>> show DataFrame as the type.
>>
>> Matei
>>
>> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>> >
>> > Reynold,
>> > But with type alias we will have the same problem, right?
>> > If the methods doesn't receive schemardd anymore, we will have to change
>> > our code to migrade from schema to dataframe. Unless we have an implicit
>> > conversion between DataFrame and SchemaRDD
>> >
>> >
>> >
>> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> >
>> >> Dirceu,
>> >>
>> >> That is not possible because one cannot overload return types.
>> >>
>> >> SQLContext.parquetFile (and many other methods) needs to return some
>> type,
>> >> and that type cannot be both SchemaRDD and DataFrame.
>> >>
>> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
>> not
>> >> break source compatibility for Scala.
>> >>
>> >>
>> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> dirceu.semigh...@gmail.com> wrote:
>> >>
>> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
>> the
>> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> DataFrame?
>> >>> With this, we don't impact in existing code for the next few releases.
>> >>>
>> >>>
>> >>>
>> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
>> >>>
>>  I want to address the issue that Matei raised about the heavy lifting
>>  required for a full SQL support. It is amazing that even after 30
>> years
>> >>> of
>>  research there is not a single good open source columnar database like
>>  Vertica. There is a column store option in MySQL, but it is not nearly
>> >>> as
>>  sophisticated as Vertica or MonetDB. But there's a true need for such
>> a
>>  system. I wonder why so and it's high time to change that.
>>  On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
>> wrote:
>> 
>> > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >>> former
>> > slightly better because it's more descriptive.
>> >
>> > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> >>> would
>> > be more clear from a user-facing perspective to at least choose a
>> >>> package
>> > name for it that omits "sql".
>> >
>> > I would also be in favor of adding a separate Spark Schema module for
>>  Spark
>> > SQL to rely on, but I imagine that might be too large a change at
>> this
>> > point?
>> >
>> > -Sandy
>> >
>> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >>> matei.zaha...@gmail.com>
>> > wrote:
>> >
>> >> (Actually when we designed Spark SQL we thought of giving it another
>> > name,
>> >> like Spark Schema, but we decided to stick with SQL since that was
>> >>> the
>> > most
>> >> obvious use case to many users.)
>> >>
>> >> Matei
>> >>
>> >>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >>> matei.zaha...@gmail.com>
>> >> wrote:
>> >>>
>> >>> While it might be possible to move this concept to Spark Core
>> > long-term,
>> >> supporting structured data efficiently does require quite a bit of
>> >>> the
>> >> infrastructure in Spark SQL, such as query planning and columnar
>>  storage.
>> >> The intent of Spark SQL though is to be more than a SQL server --
>> >>> it's
>> >> meant to be a library for manipulating structured data. Since this
>> >>> is
>> >> possible to build over the core API, it's pretty natural to
>> >>> organize it
>> >> that way, same as Spark Streaming is a library.
>> >>>
>> >>> Matei
>> >>>
>>  On Jan 26, 2015, at 4:26 PM, Koert Kuipers 
>>  wrote:
>>

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan

Ashwin,

I would say the strategies in general are:

1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.

2) Share a single spark context amongst multiple users, using fair
scheduler.  This is sort of like having a Hadoop resource pool.It
has some obvious HA/SPOF issues, namely that if the context dies then
every user using it is also dead.   Also, sharing RDDs in cached
memory has the same resiliency problems, namely that if any executor
dies then Spark must recompute / rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).

Job server can help with 1 or 2, 2 in particular.  If you have any
questions about job server, feel free to ask at the spark-jobserver
google group.   I am the maintainer.

-Evan


On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin  wrote:
> You may want to take a look at 
> https://issues.apache.org/jira/browse/SPARK-3174.
>
> On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang  
> wrote:
>> Upvote for the multitanency requirement.
>>
>> I'm also building a data analytic platform and there'll be multiple users
>> running queries and computations simultaneously. One of the paint point is
>> control of resource size. Users don't really know how much nodes they need,
>> they always use as much as possible... The result is lots of wasted resource
>> in our Yarn cluster.
>>
>> A way to 1) allow multiple spark context to share the same resource or 2)
>> add dynamic resource management for Yarn mode is very much wanted.
>>
>> Jianshi
>>
>> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin  wrote:
>>>
>>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
>>>  wrote:
>>> >> That's not something you might want to do usually. In general, a
>>> >> SparkContext maps to a user application
>>> >
>>> > My question was basically this. In this page in the official doc, under
>>> > "Scheduling within an application" section, it talks about multiuser and
>>> > fair sharing within an app. How does multiuser within an application
>>> > work(how users connect to an app,run their stuff) ? When would I want to
>>> > use
>>> > this ?
>>>
>>> I see. The way I read that page is that Spark supports all those
>>> scheduling options; but Spark doesn't give you the means to actually
>>> be able to submit jobs from different users to a running SparkContext
>>> hosted on a different process. For that, you'll need something like
>>> the job server that I referenced before, or write your own framework
>>> for supporting that.
>>>
>>> Personally, I'd use the information on that page when dealing with
>>> concurrent jobs in the same SparkContext, but still restricted to the
>>> same user. I'd avoid trying to create any application where a single
>>> SparkContext is trying to be shared by multiple users in any way.
>>>
>>> >> As far as I understand, this will cause executors to be killed, which
>>> >> means that Spark will start retrying tasks to rebuild the data that
>>> >> was held by those executors when needed.
>>> >
>>> > I basically wanted to find out if there were any "gotchas" related to
>>> > preemption on Spark. Things like say half of an application's executors
>>> > got
>>> > preempted say while doing reduceByKey, will the application progress
>>> > with
>>> > the remaining resources/fair share ?
>>>
>>> Jobs should still make progress as long as at least one executor is
>>> available. The gotcha would be the one I mentioned, where Spark will
>>> fail your job after "x" executors failed, which might be a common
>>> occurrence when preemption is enabled. That being said, it's a
>>> configurable option, so you can set "x" to a very large value and your
>>> job should keep on chugging along.
>>>
>>> The options you'd want to take a look at are: spark.task.maxFailures
>>> and spark.yarn.max.executor.failures
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Evan Chan

James,

Michael at the meetup last night said there was some development
activity around ORCFiles.

I'm curious though, what are the pros and cons of ORCFiles vs Parquet?

On Wed, Oct 8, 2014 at 10:03 AM, James Yu  wrote:
> Didn't see anyone asked the question before, but I was wondering if anyone
> knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
> getting more and more popular hi Hive world.
>
> Thanks,
> James

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan

On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell  wrote:
> I'm not sure what you mean here? Parquet is at its core just a format, you
> could store that data anywhere.
>
> Though it sounds like you saying, correct me if i'm wrong: you basically
> want a columnar abstraction layer where you can provide a different backing
> implementation to keep the columns rather than parquet-mr?
>
> I.e. you want to be able to produce a schema RDD from something like
> vertica, where updates should act as a write through cache back to vertica
> itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.

>
> I'm sorry it just sounds like its worth clearly defining what your key
> requirement/goal is.
>
>
> On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan  wrote:
>>
>> >
>> >> The reason I'm asking about the columnar compressed format is that
>> >> there are some problems for which Parquet is not practical.
>> >
>> >
>> > Can you elaborate?
>>
>> Sure.
>>
>> - Organization or co has no Hadoop, but significant investment in some
>> other NoSQL store.
>> - Need to efficiently add a new column to existing data
>> - Need to mark some existing rows as deleted or replace small bits of
>> existing data
>>
>> For these use cases, it would be much more efficient and practical if
>> we didn't have to take the origin of the data from the datastore,
>> convert it to Parquet first.  Doing so loses significant latency and
>> causes Ops headaches in having to maintain HDFS. It would be great
>> to be able to load data directly into the columnar format, into the
>> InMemoryColumnarCache.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark SQL] off-heap columnar store

2014-08-28 Thread Evan Chan

>
>> The reason I'm asking about the columnar compressed format is that
>> there are some problems for which Parquet is not practical.
>
>
> Can you elaborate?

Sure.

- Organization or co has no Hadoop, but significant investment in some
other NoSQL store.
- Need to efficiently add a new column to existing data
- Need to mark some existing rows as deleted or replace small bits of
existing data

For these use cases, it would be much more efficient and practical if
we didn't have to take the origin of the data from the datastore,
convert it to Parquet first.  Doing so loses significant latency and
causes Ops headaches in having to maintain HDFS. It would be great
to be able to load data directly into the columnar format, into the
InMemoryColumnarCache.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Evan Chan

What would be the timeline for the parquet caching work?

The reason I'm asking about the columnar compressed format is that
there are some problems for which Parquet is not practical.

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
 wrote:
>> What is the plan for getting Tachyon/off-heap support for the columnar
>> compressed store?  It's not in 1.1 is it?
>
>
> It is not in 1.1 and there are not concrete plans for adding it at this
> point.  Currently, there is more engineering investment going into caching
> parquet data in Tachyon instead.  This approach is going to have much better
> support for nested data, leverages other work being done on parquet, and
> alleviates your concerns about wire format compatibility.
>
> That said, if someone really wants to try and implement it, I don't think it
> would be very hard.  The primary issue is going to be designing a clean
> interface that is not too tied to this one implementation.
>
>>
>> Also, how likely is the wire format for the columnar compressed data
>> to change?  That would be a problem for write-through or persistence.
>
>
> We aren't making any guarantees at the moment that it won't change.  Its
> currently only intended for temporary caching of data.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[Spark SQL] off-heap columnar store

2014-08-22 Thread Evan Chan

Hey guys,

What is the plan for getting Tachyon/off-heap support for the columnar
compressed store?  It's not in 1.1 is it?

In particular:
 - being able to set TACHYON as the caching mode
 - loading of hot columns or all columns
 - write-through of columnar store data to HDFS or backing store
 - being able to start a context and query directly from Tachyon's
cached columnar data

I think most of this was in Shark 0.9.1.

Also, how likely is the wire format for the columnar compressed data
to change?  That would be a problem for write-through or persistence.

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Too late to contribute for 1.1.0?

2014-08-21 Thread Evan Chan

I'm hoping to get in some doc enhancements and small bug fixes for Spark SQL.

Also possibly a small new API to list the tables in sqlContext.

Oh, and to get the doc page I had talked about before, a list of
community Spark projects.

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Spark-JobServer moving to a new location

2014-08-21 Thread Evan Chan

Dear community,

Wow, I remember when we first open sourced the job server, at the
first Spark Summit in December.  Since then, more and more of you have
started using it and contributing to it.   It is awesome to see!

If you are not familiar with the spark job server, it is a REST API
for managing your Spark jobs and job history and status.

In order to make sure the project can continue to move forward
independently, new features developed and contributions merged, we are
moving the project to a new github organization.  The new location is:

https://github.com/spark-jobserver/spark-jobserver


The git commit history is still there, but unfortunately the pull
requests don't migrate over.   I'll be contacting each of you with
open PRs to move them over to the new location.

Happy Hacking!

Evan (@velvia)
Kelvin (@kelvinchu)
Daniel (@dan-null)

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Evan Chan

I doubt Titan would be able to give you traversal of billions of nodes in
real-time either.   In-memory traversal is typically much faster than
Cassandra-based tree traversal, even including in-memory caching.


On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath wrote:

> GraphX, like Spark, will not typically be "real-time" (where by "real-time"
> here I assume you mean of the order of a few 10s-100s ms, up to a few
> seconds).
>
> Spark can in some cases approach the upper boundary of this definition (a
> second or two, possibly less) when data is cached in memory and the
> computation is not "too heavy", while Spark Streaming may be able to get
> closer to the mid-to-upper boundary of this under similar conditions,
> especially if aggregating over relatively small windows.
>
> However, for this use case (while I haven't used GraphX yet) I would say
> something like Titan (https://github.com/thinkaurelius/titan/wiki) or a
> similar OLTP graph DB may be what you're after. But this depends on what
> kind of graph traversal you need.
>
>
>
>
> On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech  >wrote:
>
> > Hi,
> >
> > Is Graphx on top of Apache Spark, is able to process the large scale
> > distributed graph traversal and compute, in real time. What is the query
> > execution engine distributing the query on top of graphx and apache
> spark.
> > My typical use case is a large scale distributed graph traversal in real
> > time, with billions of nodes.
> >
> > Thanks,
> > Love.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Would anyone mind having a quick look at PR#288?

2014-04-02 Thread Evan Chan

https://github.com/apache/spark/pull/288

It's for fixing SPARK-1154, which would help Spark be a better citizen for
most deploys, and should be really small and easy to review.

thanks,
Evan


-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: sbt-package-bin

2014-04-02 Thread Evan Chan

Lee, sorry, I actually meant exactly that, sbt-native-packager.


On Tue, Apr 1, 2014 at 8:14 PM, Lee Mighdoll  wrote:

> > What do folks think of adding the sbt-package-bin plugin to allow easy
> > creation of a Spark .deb file?  I believe it adds all dependency jars
> into
> > a single lib/ folder, so in some ways it's even easier to manage than the
> > assembly.
> >
>
> You might also check out the
> sbt-native-packager<https://github.com/sbt/sbt-native-packager>.
>
>
> Cheers,
> Lee
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: sbt-package-bin

2014-04-01 Thread Evan Chan

Mark - sorry, would you mind expanding what the "" is?

Something like

mvn -Pdeb package

?

I get:

[ERROR] Plugin org.apache.maven.plugins:maven-compiler-plugin:3.1 or one of
its dependencies could not be resolved: Failed to read artifact descriptor
for org.apache.maven.plugins:maven-compiler-plugin:jar:3.1: Could not find
artifact org.apache:apache:pom:13 -> [Help 1]


On Tue, Apr 1, 2014 at 11:36 AM, Patrick Wendell  wrote:

> And there is a deb target as well - ah didn't see Mark's email.
>
>
> On Tue, Apr 1, 2014 at 11:36 AM, Patrick Wendell 
> wrote:
>
> > Ya there is already some fragmentation here. Maven has some "dist"
> targets
> > and there is also ./make-distribution.sh.
> >
> >
> > On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra  >wrote:
> >
> >> A basic Debian package can already be created from the Maven build: mvn
> >> -Pdeb ...
> >>
> >>
> >> On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
> >>
> >> > Also, I understand this is the last week / merge window for 1.0, so if
> >> > folks are interested I'd like to get in a PR quickly.
> >> >
> >> > thanks,
> >> > Evan
> >> >
> >> >
> >> >
> >> > On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:
> >> >
> >> > > Hey folks,
> >> > >
> >> > > We are in the middle of creating a Chef recipe for Spark.   As part
> of
> >> > > that we want to create a Debian package for Spark.
> >> > >
> >> > > What do folks think of adding the sbt-package-bin plugin to allow
> easy
> >> > > creation of a Spark .deb file?  I believe it adds all dependency
> jars
> >> > into
> >> > > a single lib/ folder, so in some ways it's even easier to manage
> than
> >> the
> >> > > assembly.
> >> > >
> >> > > Also I'm not sure if there's an equivalent plugin for Maven.
> >> > >
> >> > > thanks,
> >> > > Evan
> >> > >
> >> > >
> >> > > --
> >> > > --
> >> > >  Evan Chan
> >> > > Staff Engineer
> >> > > e...@ooyala.com  |
> >> > >
> >> > > <http://www.ooyala.com/> <http://www.facebook.com/ooyala><
> >> > http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala
> >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > --
> >> > Evan Chan
> >> > Staff Engineer
> >> > e...@ooyala.com  |
> >> >
> >> > <http://www.ooyala.com/>
> >> > <http://www.facebook.com/ooyala><
> http://www.linkedin.com/company/ooyala
> >> ><
> >> > http://www.twitter.com/ooyala>
> >> >
> >>
> >
> >
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: sbt-package-bin

2014-04-01 Thread Evan Chan

Also, I understand this is the last week / merge window for 1.0, so if
folks are interested I'd like to get in a PR quickly.

thanks,
Evan



On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan  wrote:

> Hey folks,
>
> We are in the middle of creating a Chef recipe for Spark.   As part of
> that we want to create a Debian package for Spark.
>
> What do folks think of adding the sbt-package-bin plugin to allow easy
> creation of a Spark .deb file?  I believe it adds all dependency jars into
> a single lib/ folder, so in some ways it's even easier to manage than the
> assembly.
>
> Also I'm not sure if there's an equivalent plugin for Maven.
>
> thanks,
> Evan
>
>
> --
> --
>  Evan Chan
> Staff Engineer
> e...@ooyala.com  |
>
> <http://www.ooyala.com/> 
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>
>
>


-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

sbt-package-bin

2014-04-01 Thread Evan Chan

Hey folks,

We are in the middle of creating a Chef recipe for Spark.   As part of that
we want to create a Debian package for Spark.

What do folks think of adding the sbt-package-bin plugin to allow easy
creation of a Spark .deb file?  I believe it adds all dependency jars into
a single lib/ folder, so in some ways it's even easier to manage than the
assembly.

Also I'm not sure if there's an equivalent plugin for Maven.

thanks,
Evan


-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: [DISCUSS] Shepherding PRs

2014-03-27 Thread Evan Chan

+1.

I really like this idea.  I know having a shepherd would have been really
helpful for a couple of changes.


On Thu, Mar 27, 2014 at 8:18 AM, Andy Konwinski wrote:

> I thought this email exchange from the Mesos dev list was worth sharing.
> The Mesos project is trying out a process wherein they assign shepherds
> (who are committers) to significant issues.
>
> I'm not proposing that this necessarily makes sense for us, but I thought
> it might be worth discussing.
>
> Andy
> -- Forwarded message --
> From: "Benjamin Mahler" 
> Date: Mar 24, 2014 11:47 PM
> Subject: Re: Shepherding on ExternalContainerizer
> To: "dev" 
> Cc:
>
> Hey Till,
>
> We want to foster a healthy review culture, and so, as you observed, we
> thought we would try out the notion of having a "shepherd" for each review.
>
> In the past we've had some reviews stagnate because there was no clear
> accountability for getting it committed. Meaning, various committers would
> be included in the 'Reviewers' and each would provide feedback
> independently, but there was no single person accountable for "shepherding"
> the change to a shippable state, and ultimately committing it.
>
> We've also had issues with having a lot of lower value reviews crowding out
> higher value reviews. Often these lower value reviews are things like
> cleanup, refactoring, etc, which tend to be easier to review. Shepherding
> doesn't address this as directly, but it is also an effort to ensure we
> balance low value changes (technical debt, refactoring, cleanup, etc) with
> higher value changes (features, bug fixes, etc) via shepherd assignment.
>
> This is why we've been trying out the "shepherd" concept.
>
> Related to this (and *not* related to your changes Till :)), I would
> encourage two behaviors from "reviewees" to ameliorate the situation:
>
> 1. Please be cognizant of the fact that reviewing tends to be a bottleneck
> and that reviewer time is currently at a premium. This means, please be
> very thorough in your work and also look over your patches before sending
> them out. This saves your time (faster reviews) and reviewers' time (fewer
> comments needed). Feel free to reach out for feedback before sending out
> reviews as well (if feasible).
>
> 2. Also, be cognizant of the fact that we need to balance low and high
> priority reviews. Sometimes we don't have time to review low value cleanup
> work when there are a lot of things in flight. For example, I have a bunch
> of old cleanup patches from when we need to get more important things
> committed, and I know Vinod has old cleanup patches like this as well.
>
> This all being said, the external containerizer is high value and should
> definitely be getting reviews. I will take some time to go over your
> changes later this week with Ian, when I'll be free from a deadline ;). We
> can help "pair shepherd" your changes.
>
> Ben
>
>
> On Mon, Mar 24, 2014 at 4:32 PM, Till Toenshoff  wrote:
>
> > Dear Devs/Committers,
> >
> > after having developed the ExternalContainerizer, I am now obviously
> eager
> > to get it committed. After receiving and addressing a couple of comments
> > (thanks @all who commented - that helped a lot), I now am once again in a
> > stage of waiting and keeping fingers crossed that my patch won't need
> > rebasing before someone has a thorough look at it. I do appreciate and
> > fully understand the fact that you committers are under heavy load.
> >
> > By experience and seeing some RR comments, I learned that there appears
> to
> > be a new entity in our review process; a "shepherd". Sounds like a great
> > idea, even though I am not entirely sure what that means in detail for
> > Mesos. I guess that is something that makes sure that final commit
> > decisions  are done by a single voice, preventing contradicting comments
> > etc... Knowing that other projects actually demand the patch-submitter to
> ask
> > for shepherding, I figured why not doing the same.
> >
> > For that ExternalContainerizer baby, I would kindly like to call out for
> a
> > shepherd. Guessing that a shepherd needs to be a committer but also
> knowing
> > that Ian is very deeply involved within containerizing, I would like to
> > "nominate" Niklas as a committer in collaboration with Ian. Hope that
> makes
> > sense and don't hesitate to tell me that this was not the right way to
> > achieve shepherding.
> >
> > cheers!
> > Till
> >
> >
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: new Catalyst/SQL component merged into master

2014-03-25 Thread Evan Chan

HI Michael,

It's not publicly available right now, though we can probably chat
about it offline.   It's not a super novel concept or anything, in
fact I had proposed it a long time ago on the mailing lists.

-Evan

On Mon, Mar 24, 2014 at 1:34 PM, Michael Armbrust
 wrote:
> Hi Evan,
>
> Index support is definitely something we would like to add, and it is
> possible that adding support for your custom indexing solution would not be
> too difficult.
>
> We already push predicates into hive table scan operators when the
> predicates are over partition keys.  You can see an example of how we
> collect filters and decide which can pushed into the scan using the
> HiveTableScan query planning strategy.
>
> I'd like to know more about your indexing solution.  Is this something
> publicly available?  One concern here is that the query planning code is not
> considered a public API and so is likely to change quite a bit as we improve
> the optimizer.  Its not currently something that we plan to expose for
> external components to modify.
>
> Michael
>
>
> On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan  wrote:
>>
>> Hi Michael,
>>
>> Congrats, this is really neat!
>>
>> What thoughts do you have regarding adding indexing support and
>> predicate pushdown to this SQL framework?Right now we have custom
>> bitmap indexing to speed up queries, so we're really curious as far as
>> the architectural direction.
>>
>> -Evan
>>
>>
>> On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
>>  wrote:
>> >>
>> >> It will be great if there are any examples or usecases to look at ?
>> >>
>> > There are examples in the Spark documentation.  Patrick posted and
>> > updated
>> > copy here so people can see them before 1.0 is released:
>> >
>> > http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
>> >
>> >> Does this feature has different usecases than shark or more cleaner as
>> >> hive dependency is gone?
>> >>
>> > Depending on how you use this, there is still a dependency on Hive (By
>> > default this is not the case.  See the above documentation for more
>> > details).  However, the dependency is on a stock version of Hive instead
>> > of
>> > one modified by the AMPLab.  Furthermore, Spark SQL has its own
>> > optimizer,
>> > instead of relying on the Hive optimizer.  Long term, this is going to
>> > give
>> > us a lot more flexibility to optimize queries specifically for the Spark
>> > execution engine.  We are actively porting over the best parts of shark
>> > (specifically the in-memory columnar representation).
>> >
>> > Shark still has some features that are missing in Spark SQL, including
>> > SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
>> > status, it'll likely become the new backend for Shark.
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Spark 0.9.1 release

2014-03-25 Thread Evan Chan

Hey guys,

I think SPARK-1138 should be resolved before releasing Spark 0.9.1.
It's affecting multiple users ability to use Spark 0.9 with various
versions of Hadoop.
I have one fix but not sure if it works for others.

-Evan


On Mon, Mar 24, 2014 at 5:30 PM, Tathagata Das
 wrote:
> Hello Kevin,
>
> A fix for SPARK-782 would definitely simplify building against Spark.
> However, its possible that a fix for this issue in 0.9.1 will break
> the builds (that reference spark) of existing 0.9 users, either due to
> a change in the ASM version, or for being incompatible with their
> current workarounds for this issue. That is not a good idea for a
> maintenance release, especially when 1.0 is not too far away.
>
> Can you (and others) elaborate more on the current workarounds that
> you have for this issue? Its best to understand all the implications
> of this fix.
>
> Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven.
>
> TD
>
> On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey  wrote:
>> Is there any way that [SPARK-782] (Shade ASM) can be included?  I see that
>> it is not currently backported to 0.9.  But there is no single issue that
>> has caused us more grief as we integrate spark-core with other project
>> dependencies.  There are way too many libraries out there in addition to
>> Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading),
>> including some Hive and Hadoop libraries and a number of servlet libraries.
>> We can't control those, but if Spark were well behaved in this regard, it
>> would help.  Even for a maintenance release, and even if 1.0 is only 6 weeks
>> away!
>>
>> (For those not following 782, according to Jira comments, the SBT build
>> shades it, but it is the Maven build that ends up in Maven Central.)
>>
>> Thanks
>> Kevin Markey
>>
>>
>>
>>
>> On 03/19/2014 06:07 PM, Tathagata Das wrote:
>>>
>>>   Hello everyone,
>>>
>>> Since the release of Spark 0.9, we have received a number of important bug
>>> fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
>>> going to cut a release candidate soon and we would love it if people test
>>> it out. We have backported several bug fixes into the 0.9 and updated JIRA
>>>
>>> accordingly<https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)>.
>>>
>>> Please let me know if there are fixes that were not backported but you
>>> would like to see them in 0.9.1.
>>>
>>> Thanks!
>>>
>>> TD
>>>
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: spark jobserver

2014-03-24 Thread Evan Chan

Suhas,

You're welcome.  We are planning to speak about the job server at the
Spark Summit by the way.

-Evan


On Mon, Mar 24, 2014 at 9:38 AM, Suhas Satish  wrote:
> Thanks a lot for this update Evan , really appreciate the effort.
>
> On Monday, March 24, 2014, Evan Chan  wrote:
>
>> Suhas, here is the update, which I posted to SPARK-818:
>>
>> An update: we have put up the final job server here:
>> https://github.com/ooyala/spark-jobserver
>>
>> The plan is to have a spark-contrib repo/github account and this would
>> be one of the first projects.
>>
>> See SPARK-1283 for the ticket to track spark-contrib.
>>
>> On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish 
>> >
>> wrote:
>> > Any plans of integrating SPARK-818 into spark trunk ? The pull request is
>> > open.
>> > It offers spark as a service with spark jobserver running as a separate
>> > process.
>> >
>> >
>> > Thanks,
>> > Suhas.
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com   |
>>
>
>
> --
> Cheers,
> Suhas.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Spark 0.9.1 release

2014-03-24 Thread Evan Chan

Patrick, yes, that is indeed a risk.

On Mon, Mar 24, 2014 at 12:30 AM, Tathagata Das
 wrote:
> Patrick, that is a good point.
>
>
> On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell wrote:
>
>> > Spark's dependency graph in a maintenance
>> *Modifying* Spark's dependency graph...
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: spark jobserver

2014-03-24 Thread Evan Chan

Suhas, here is the update, which I posted to SPARK-818:

An update: we have put up the final job server here:
https://github.com/ooyala/spark-jobserver

The plan is to have a spark-contrib repo/github account and this would
be one of the first projects.

See SPARK-1283 for the ticket to track spark-contrib.

On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish  wrote:
> Any plans of integrating SPARK-818 into spark trunk ? The pull request is
> open.
> It offers spark as a service with spark jobserver running as a separate
> process.
>
>
> Thanks,
> Suhas.

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Spark 0.9.1 release

2014-03-24 Thread Evan Chan

@Tathagata,  the PR is here:
https://github.com/apache/spark/pull/215

On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das
 wrote:
> @Shivaram, That is a useful patch but I am bit afraid merge it in.
> Randomizing the executor has performance implications, especially for Spark
> Streaming. The non-randomized ordering of allocating machines to tasks was
> subtly helping to speed up certain window-based shuffle operations.  For
> example, corresponding shuffle partitions in multiple shuffles using the
> same partitioner were likely to be co-located, that is, shuffle partition 0
> were likely to be on the same machine for multiple shuffles. While this is
> the not a reliable mechanism to rely on, randomization may lead to
> performance degradation. So I am afraid to merge this one without
> understanding the consequences.
>
> @Evan, I have already cut a release! You can submit the PR and we can merge
> it branch-0.9. If we have to cut another release, then we can include it.
>
>
>
> On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan  wrote:
>
>> I also have a really minor fix for SPARK-1057  (upgrading fastutil),
>> could that also make it in?
>>
>> -Evan
>>
>>
>> On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
>>  wrote:
>> > Sorry this request is coming in a bit late, but would it be possible to
>> > backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
>> > executor offers and I would like to use this in a release sooner rather
>> > than later.
>> >
>> > Thanks
>> > Shivaram
>> >
>> > [1]
>> >
>> https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
>> >
>> >
>> > On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta 
>> wrote:
>> >
>> >> Thank You! We plan to test out 0.9.1 on YARN once it is out.
>> >>
>> >> Regards,
>> >> Bhaskar
>> >>
>> >> On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves 
>> wrote:
>> >>
>> >> > I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when
>> running
>> >> > on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
>> >> > submitting user - JIRA in.  The pyspark one I would consider more of
>> an
>> >> > enhancement so might not be appropriate for a point release.
>> >> >
>> >> >
>> >> >  [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
>> YA...
>> >> > org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
>> >> >
>> >>
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
>> >> > at org.apache.spark.schedule...
>> >> > View on spark-project.atlassian.net Preview by Yahoo
>> >> >
>> >> >
>> >> >  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
>> >> > This means that they can't write/read from files that the yarn user
>> >> > doesn't have permissions to but the submitting user does.
>> >> > View on spark-project.atlassian.net Preview by Yahoo
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta > >
>> >> > wrote:
>> >> >
>> >> > It will be great if
>> >> > "SPARK-1101<https://spark-project.atlassian.net/browse/SPARK-1101>:
>> >> > Umbrella
>> >> > for hardening Spark on YARN" can get into 0.9.1.
>> >> >
>> >> > Thanks,
>> >> > Bhaskar
>> >> >
>> >> >
>> >> > On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
>> >> > wrote:
>> >> >
>> >> > >  Hello everyone,
>> >> > >
>> >> > > Since the release of Spark 0.9, we have received a number of
>> important
>> >> > bug
>> >> > > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
>> >> are
>> >> > > going to cut a release candidate soon and we would love it if people
>> >> test
>> >> > > it out. We have backported several bug fixes into the 0.9 and
>> updated
>> >> > JIRA
>> >> > > accordingly<
>> >> > >
>> >> >
>> >>
>> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
>> >> > > >.
>> >> > > Please let me know if there are fixes that were not backported but
>> you
>> >> > > would like to see them in 0.9.1.
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > > TD
>> >> > >
>> >> >
>> >>
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: new Catalyst/SQL component merged into master

2014-03-23 Thread Evan Chan

Hi Michael,

Congrats, this is really neat!

What thoughts do you have regarding adding indexing support and
predicate pushdown to this SQL framework?Right now we have custom
bitmap indexing to speed up queries, so we're really curious as far as
the architectural direction.

-Evan


On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
 wrote:
>>
>> It will be great if there are any examples or usecases to look at ?
>>
> There are examples in the Spark documentation.  Patrick posted and updated
> copy here so people can see them before 1.0 is released:
> http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
>
>> Does this feature has different usecases than shark or more cleaner as
>> hive dependency is gone?
>>
> Depending on how you use this, there is still a dependency on Hive (By
> default this is not the case.  See the above documentation for more
> details).  However, the dependency is on a stock version of Hive instead of
> one modified by the AMPLab.  Furthermore, Spark SQL has its own optimizer,
> instead of relying on the Hive optimizer.  Long term, this is going to give
> us a lot more flexibility to optimize queries specifically for the Spark
> execution engine.  We are actively porting over the best parts of shark
> (specifically the in-memory columnar representation).
>
> Shark still has some features that are missing in Spark SQL, including
> SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
> status, it'll likely become the new backend for Shark.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Spark 0.9.1 release

2014-03-23 Thread Evan Chan

I also have a really minor fix for SPARK-1057  (upgrading fastutil),
could that also make it in?

-Evan


On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
 wrote:
> Sorry this request is coming in a bit late, but would it be possible to
> backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
> executor offers and I would like to use this in a release sooner rather
> than later.
>
> Thanks
> Shivaram
>
> [1]
> https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
>
>
> On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta  wrote:
>
>> Thank You! We plan to test out 0.9.1 on YARN once it is out.
>>
>> Regards,
>> Bhaskar
>>
>> On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves  wrote:
>>
>> > I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running
>> > on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
>> > submitting user - JIRA in.  The pyspark one I would consider more of an
>> > enhancement so might not be appropriate for a point release.
>> >
>> >
>> >  [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA...
>> > org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
>> >
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
>> > at org.apache.spark.schedule...
>> > View on spark-project.atlassian.net Preview by Yahoo
>> >
>> >
>> >  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
>> > This means that they can't write/read from files that the yarn user
>> > doesn't have permissions to but the submitting user does.
>> > View on spark-project.atlassian.net Preview by Yahoo
>> >
>> >
>> >
>> >
>> >
>> > On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta 
>> > wrote:
>> >
>> > It will be great if
>> > "SPARK-1101<https://spark-project.atlassian.net/browse/SPARK-1101>:
>> > Umbrella
>> > for hardening Spark on YARN" can get into 0.9.1.
>> >
>> > Thanks,
>> > Bhaskar
>> >
>> >
>> > On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
>> > wrote:
>> >
>> > >  Hello everyone,
>> > >
>> > > Since the release of Spark 0.9, we have received a number of important
>> > bug
>> > > fixes and we would like to make a bug-fix release of Spark 0.9.1. We
>> are
>> > > going to cut a release candidate soon and we would love it if people
>> test
>> > > it out. We have backported several bug fixes into the 0.9 and updated
>> > JIRA
>> > > accordingly<
>> > >
>> >
>> https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
>> > > >.
>> > > Please let me know if there are fixes that were not backported but you
>> > > would like to see them in 0.9.1.
>> > >
>> > > Thanks!
>> > >
>> > > TD
>> > >
>> >
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Announcing the official Spark Job Server repo

2014-03-23 Thread Evan Chan

Andy, doesn't Marathon handle fault tolerance amongst its apps?  ie if
you say that N instances of an app are running, and one shuts off,
then it spins up another one no?

The tricky thing was that I was planning to use Akka Cluster to
coordinate, but Mesos itself can be used to coordinate as well, which
is an overlap/ but I didn't want ot make job server HA just
reliant only on Mesos... Anyways we can discuss offline if needed.

On Thu, Mar 20, 2014 at 1:35 AM, andy petrella  wrote:
> Heya,
> That's cool you've already hacked something for this in the scripts!
>
> I have a related question, how would it work actually. I mean, to have this
> Job Server fault tolerant using Marathon, I would guess that it will need
> to be itself a Mesos framework, and able to publish its resources needs.
> And also, for that, the Job Server has to be aware of the resources needed
> by the Spark drivers that it will run, which is not as easy to guess,
> unless it is provided by the job itself?
>
> I didn't checked the Job Server deep enough so it might be already the case
> (or I'm expressing something completely dumb ^^).
>
> For sure, we'll try to share it when we'll reach this point to deploy using
> marathon (should be planned for April)
>
> greetz and again, Nice Work Evan!
>
> Ndi
>
> On Wed, Mar 19, 2014 at 7:27 AM, Evan Chan  wrote:
>
>> Andy,
>>
>> Yeah, we've thought of deploying this on Marathon ourselves, but we're
>> not sure how much Mesos we're going to use yet.   (Indeed if you look
>> at bin/server_start.sh, I think I set up the PORT environment var
>> specifically for Marathon.)This is also why we have deploy scripts
>> which package into .tar.gz, again for Mesos deployment.
>>
>> If you do try this, please let us know.  :)
>>
>> -Evan
>>
>>
>> On Tue, Mar 18, 2014 at 3:57 PM, andy petrella 
>> wrote:
>> > tad! That's awesome.
>> >
>> > A quick question, does someone has insights regarding having such
>> > JobServers deployed using Marathon on Mesos?
>> >
>> > I'm thinking about an arch where Marathon would deploy and keep the Job
>> > Servers running along with part of the whole set of apps deployed on it
>> > regarding the resources needed (à la Jenkins).
>> >
>> > Any idea is welcome.
>> >
>> > Back to the news, Evan + Ooyala team: Great Job again.
>> >
>> > andy
>> >
>> > On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra > >wrote:
>> >
>> >> W00t!
>> >>
>> >> Thanks for releasing this, Evan.
>> >>
>> >> - Henry
>> >>
>> >> On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan  wrote:
>> >> > Dear Spark developers,
>> >> >
>> >> > Ooyala is happy to announce that we have pushed our official, Spark
>> >> > 0.9.0 / Scala 2.10-compatible, job server as a github repo:
>> >> >
>> >> > https://github.com/ooyala/spark-jobserver
>> >> >
>> >> > Complete with unit tests, deploy scripts, and examples.
>> >> >
>> >> > The original PR (#222) on incubator-spark is now closed.
>> >> >
>> >> > Please have a look; pull requests are very welcome.
>> >> > --
>> >> > --
>> >> > Evan Chan
>> >> > Staff Engineer
>> >> > e...@ooyala.com  |
>> >>
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: repositories for spark jars

2014-03-19 Thread Evan Chan

The alternative is for Spark to not explicitly include hadoop_client,
perhaps only as "provided", and provide a facility to insert the
hadoop client jars of your choice at packaging time.   Unfortunately,
hadoop_client pulls in a ton of other deps, so it's not as simple as
copying one extra jar into dist/jars.

On Mon, Mar 17, 2014 at 10:58 AM, Patrick Wendell  wrote:
> Hey Nathan,
>
> I don't think this would be possible because there are at least dozens
> of permutations of Hadoop versions (different vendor distros X
> different versions X YARN vs not YARN, etc) and maybe hundreds. So
> publishing new artifacts for each would be really difficult.
>
> What is the exact problem you ran into? Maybe we need to improve the
> documentation to make it more clear how to correctly link against
> spark/hadoop for user applications. Basically the model we have now is
> users link against Spark and then link against the hadoop-client
> relevant to their version of Hadoop.
>
> - Patrick
>
> On Mon, Mar 17, 2014 at 9:50 AM, Nathan Kronenfeld
>  wrote:
>> After just spending a couple days fighting with a new spark installation,
>> getting spark and hadoop version numbers matching everywhere, I have a
>> suggestion I'd like to put out there.
>>
>> Can we put the hadoop version against which the spark jars were built into
>> the version number?
>>
>> I noticed that the Cloudera maven repo has started to do this (
>> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.10/)
>> - sadly, though, only with the cdh5.x versions, not with the 4.x versions
>> for which they also have spark parcels.  But I see no signs of it in the
>> central maven repo.
>>
>> Is this already done in some other repo about which I don't know, perhaps?
>>
>> I know it would save us a lot of time and grief simply to be able to point
>> a project we build at the right version, and not have to rebuild and deploy
>> spark manually.
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenf...@oculusinfo.com



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Evan Chan

https://spark-project.atlassian.net/browse/SPARK-1283

On Wed, Mar 19, 2014 at 10:59 AM, Gerard Maas  wrote:
> this is cool +1
>
>
> On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell  wrote:
>
>> Evan - yep definitely open a JIRA. It would be nice to have a contrib
>> repo set-up for the 1.0 release.
>>
>> On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan  wrote:
>> > Matei,
>> >
>> > Maybe it's time to explore the spark-contrib idea again?   Should I
>> > start a JIRA ticket?
>> >
>> > -Evan
>> >
>> >
>> > On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia 
>> wrote:
>> >> Cool, glad to see this posted! I've added a link to it at
>> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
>> >>
>> >> Matei
>> >>
>> >> On Mar 18, 2014, at 1:51 PM, Evan Chan  wrote:
>> >>
>> >>> Dear Spark developers,
>> >>>
>> >>> Ooyala is happy to announce that we have pushed our official, Spark
>> >>> 0.9.0 / Scala 2.10-compatible, job server as a github repo:
>> >>>
>> >>> https://github.com/ooyala/spark-jobserver
>> >>>
>> >>> Complete with unit tests, deploy scripts, and examples.
>> >>>
>> >>> The original PR (#222) on incubator-spark is now closed.
>> >>>
>> >>> Please have a look; pull requests are very welcome.
>> >>> --
>> >>> --
>> >>> Evan Chan
>> >>> Staff Engineer
>> >>> e...@ooyala.com  |
>> >>
>> >
>> >
>> >
>> > --
>> > --
>> > Evan Chan
>> > Staff Engineer
>> > e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Announcing the official Spark Job Server repo

2014-03-18 Thread Evan Chan

Matei,

Maybe it's time to explore the spark-contrib idea again?   Should I
start a JIRA ticket?

-Evan


On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia  wrote:
> Cool, glad to see this posted! I've added a link to it at 
> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
>
> Matei
>
> On Mar 18, 2014, at 1:51 PM, Evan Chan  wrote:
>
>> Dear Spark developers,
>>
>> Ooyala is happy to announce that we have pushed our official, Spark
>> 0.9.0 / Scala 2.10-compatible, job server as a github repo:
>>
>> https://github.com/ooyala/spark-jobserver
>>
>> Complete with unit tests, deploy scripts, and examples.
>>
>> The original PR (#222) on incubator-spark is now closed.
>>
>> Please have a look; pull requests are very welcome.
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Announcing the official Spark Job Server repo

2014-03-18 Thread Evan Chan

Andy,

Yeah, we've thought of deploying this on Marathon ourselves, but we're
not sure how much Mesos we're going to use yet.   (Indeed if you look
at bin/server_start.sh, I think I set up the PORT environment var
specifically for Marathon.)This is also why we have deploy scripts
which package into .tar.gz, again for Mesos deployment.

If you do try this, please let us know.  :)

-Evan


On Tue, Mar 18, 2014 at 3:57 PM, andy petrella  wrote:
> tad! That's awesome.
>
> A quick question, does someone has insights regarding having such
> JobServers deployed using Marathon on Mesos?
>
> I'm thinking about an arch where Marathon would deploy and keep the Job
> Servers running along with part of the whole set of apps deployed on it
> regarding the resources needed (à la Jenkins).
>
> Any idea is welcome.
>
> Back to the news, Evan + Ooyala team: Great Job again.
>
> andy
>
> On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra 
> wrote:
>
>> W00t!
>>
>> Thanks for releasing this, Evan.
>>
>> - Henry
>>
>> On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan  wrote:
>> > Dear Spark developers,
>> >
>> > Ooyala is happy to announce that we have pushed our official, Spark
>> > 0.9.0 / Scala 2.10-compatible, job server as a github repo:
>> >
>> > https://github.com/ooyala/spark-jobserver
>> >
>> > Complete with unit tests, deploy scripts, and examples.
>> >
>> > The original PR (#222) on incubator-spark is now closed.
>> >
>> > Please have a look; pull requests are very welcome.
>> > --
>> > --
>> > Evan Chan
>> > Staff Engineer
>> > e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Announcing the official Spark Job Server repo

2014-03-18 Thread Evan Chan

Dear Spark developers,

Ooyala is happy to announce that we have pushed our official, Spark
0.9.0 / Scala 2.10-compatible, job server as a github repo:

https://github.com/ooyala/spark-jobserver

Complete with unit tests, deploy scripts, and examples.

The original PR (#222) on incubator-spark is now closed.

Please have a look; pull requests are very welcome.
-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-03-14 Thread Evan Chan

Jsuereth -

Thanks for jumping on this thread.  There are a couple things which
would help SBT IMHO (w.r.t. issues mentioned here):

- A way of generating a project-wide pom.xml  (I believe make-pom only
generates it for each sub-project)
   - and probably better or more feature-complete POMs, but the Maven
folks here can speak to that
- Make the sbt-pom-reader plugin officially part of SBT, and make it
work well (again the Maven folks need to jump in here, though plugins
can't be directly translated)
- Have the sbt-assembly plugin officially maintained by Typesafe, or
part of SBT  most Maven folks expect not to have to include a
plugin to generate a fat jar, and it's a pretty essential plugin for
just about every SBT project
- Also there is no equivalent (AFAIK) to the maven shader plugin.

I also wish that the dependency-graph plugin was included by default,
but that's just me  :)

-Evan


On Fri, Mar 14, 2014 at 6:47 AM, jsuereth  wrote:
> Hey guys -
>
> If there's anything we can do to improve the sbt experience, let me know.
> I'd be extremely interested to see how/where there are issues integrating
> sbt with the existing Hadoop ecosystem.
>
> Particularly the difficulties in using Sbt + Maven together (something which
> tends block more than just spark from adopting sbt).
>
> I'm more than happy to listen and see what we can do on the sbt side to make
> this as seamless as possible for all parties.
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315p5682.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: spark config params conventions

2014-03-12 Thread Evan Chan

+1.

Not just for Typesafe Config, but if we want to consider hierarchical
configs like JSON rather than flat key mappings, it is necessary.  It
is also clearer.

On Wed, Mar 12, 2014 at 9:58 AM, Aaron Davidson  wrote:
> Should we try to deprecate these types of configs for 1.0.0? We can start
> by accepting both and giving a warning if you use the old one, and then
> actually remove them in the next minor release. I think
> "spark.speculation.enabled=true" is better than "spark.speculation=true",
> and if we decide to use typesafe configs again ourselves, this change is
> necessary.
>
> We actually don't have to ever complete the deprecation - we can always
> accept both spark.speculation and spark.speculation.enabled, and people
> just have to use the latter if they want to use typesafe config.
>
>
> On Wed, Mar 12, 2014 at 9:24 AM, Mark Hamstra wrote:
>
>> That's the whole reason why some of the intended configuration changes
>> were backed out just before the 0.9.0 release.  It's a well-known issue,
>> even if a completely satisfactory solution isn't as well-known and is
>> probably something which should do another iteration on.
>>
>>
>> On Wed, Mar 12, 2014 at 9:10 AM, Koert Kuipers  wrote:
>>
>>> i am reading the spark configuration params from another configuration
>>> object (typesafe config) before setting them as system properties.
>>>
>>> i noticed typesafe config has trouble with settings like:
>>> spark.speculation=true
>>> spark.speculation.interval=0.5
>>>
>>> the issue seems to be that if spark.speculation is a "container" that has
>>> more values inside then it cannot be also a value itself, i think. so this
>>> would work fine:
>>> spark.speculation.enabled=true
>>> spark.speculation.interval=0.5
>>>
>>> just a heads up. i would probably suggest we avoid this situation.
>>>
>>
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Spark 0.9.0 and log4j

2014-03-07 Thread Evan Chan

Hey guys,

This is a follow-up to this semi-recent thread:
http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html

0.9.0 final is causing issues for us as well because we use Logback as
our backend and Spark requires Log4j now.

I see Patrick has a PR #560 to incubator-spark, was that merged in or
left out?

Also I see references to a new PR that might fix this, but I can't
seem to find it in the github open PR page.   Anybody have a link?

As a last resort we can switch to Log4j, but would rather not have to
do that if possible.

thanks,
Evan

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: special case of custom partitioning

2014-03-06 Thread Evan Chan

I would love to hear the answer to this as well.

On Thu, Mar 6, 2014 at 4:09 AM, Manoj Awasthi  wrote:
> Hi All,
>
>
> I have a three machine cluster. I have two RDDs each consisting of (K,V)
> pairs. RDDs have just three keys 'a', 'b' and 'c'.
>
> // list1 - List(('a',1), ('b',2), 
> val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3))
>
> // list2 - List(('a',2), ('b',7), 
> val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3))
>
> By using a HashPartitioner with 3 partitions I can achieve that each of the
> keys ('a', 'b' and 'c') in each RDD gets partitioned on different machines
> on cluster (based on the hashCode).
>
> Problem is that I cannot deterministically do the same allocation for
> second RDD? (all 'a's from rdd2 going to the same machine where 'a's from
> first RDD went to).
>
> Is there a way to achieve this?
>
> Manoj



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

New blog post on Spark + Parquet + Scrooge

2014-02-28 Thread Evan Chan

http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark

Enjoy!

By the way, I was not able to subscribe to the user-digest list for
some reason.   The help email claims


Similar addresses exist for the digest list:
   
   


But if you send a email to the digest-subscribe it bounces back with a
help email.


-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

New JIRA ticket: cleaning up app-* folders

2014-02-28 Thread Evan Chan

Hey guys,

FYI, I just filed a new ticket,
https://spark-project.atlassian.net/browse/SPARK-1154,
concerning that Spark leaves tons of app-* folders on disk, which can
quickly fill up the disk.
Feel free to comment.

I believe it is better for Spark to clean itself up; ie users should
not need a cron job to clean up old folders.

thanks,
-Evan

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: Spark JIRA

2014-02-28 Thread Evan Chan

Ok, will continue to work with the existing one for now.

On Fri, Feb 28, 2014 at 11:36 AM, Josh Rosen  wrote:
> Apache INFRA has been trying to import our JIRA issues since June 2013:
> https://issues.apache.org/jira/browse/INFRA-6419.  It seems that JIRA
> introduces minor import incompatibilities with every update to JIRA
> OnDemand, so it's been a challenge to get the Apache and Atlasssian JIRA
> versions to sync up.
>
>
>
>
> On Fri, Feb 28, 2014 at 11:36 AM, Nan Zhu  wrote:
>
>> I think they are working on it?
>> https://issues.apache.org/jira/browse/SPARK
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>>
>> On Friday, February 28, 2014 at 2:29 PM, Evan Chan wrote:
>>
>> > Hey guys,
>> >
>> > There is no plan to move the Spark JIRA from the current
>> > https://spark-project.atlassian.net/
>> >
>> > right?
>> >
>> > --
>> > --
>> > Evan Chan
>> > Staff Engineer
>> > e...@ooyala.com (mailto:e...@ooyala.com) |
>> >
>> >
>>
>>
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Spark JIRA

2014-02-28 Thread Evan Chan

Hey guys,

There is no plan to move the Spark JIRA from the current
https://spark-project.atlassian.net/

right?

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-26 Thread Evan Chan

Can't maven pom's include other ones?  So what if we remove the
artifact specs from the main pom, have them generated by sbt make-pom,
and include the generated file in the main pom.xml?I guess, just
trying to figure out how much this would help (it seems at least it
would remove the issue of maintaining and translating dependencies and
exclusions).   If the burden of maintaining the plugins turns out to
be the heavier commitment then maybe it's not worth it.

On Wed, Feb 26, 2014 at 11:55 AM, Mark Hamstra  wrote:
> Yes, but the POM generated in that fashion is only sufficient for linking
> with Spark, not for building Spark or serving as a basis from which to
> build a customized Spark with Maven.  So, starting from SparkBuild.scala
> and generating a POM with make-pom, those who wish to build a customized
> Spark with Maven will have to figure out how to add various Maven plugins
> and other stuff to the generated POM to actually have something useful.
>  Going the other way, starting from a POM that is sufficient to build Spark
> and generating an SBT build with sbt-pom-reader, the Maven plugins in the
> POM appear to be ignored cleanly, but then the developer wishing to build
> Spark using SBT has the burden of figuring out how to add the equivalent of
> the Maven plugins in order to build the assemblies, among other things.
>  Neither way looks completely obvious to me to do programmatically.  Either
> should be do-able given sufficient development and maintenance resources,
> but that could be a pretty heavy commitment (and when Josh Suereth says wrt
> to sbt-pom-reader that mapping maven plugins into sbt is practically a
> failed task, I have every expectation that generating a completely
> satisfactory SBT build from a Maven build would be quite challenging.)
>
>
> On Wed, Feb 26, 2014 at 11:34 AM, Evan Chan  wrote:
>
>> Mark,
>>
>> No, I haven't tried this myself yet  :-p   Also I would expect that
>> sbt-pom-reader does not do assemblies at all  because that is an
>> SBT plugin, so we would still need code to include sbt-assembly.
>> There is also the trick question of how to include the assembly stuff
>> into sbt-pom-reader generated projects.  So, needs much more
>> investigation.
>>
>> My hunch is that it's easier to generate the pom from SBT (make-pom)
>> than the other way around.
>>
>> On Wed, Feb 26, 2014 at 10:54 AM, Mark Hamstra 
>> wrote:
>> > Evan,
>> >
>> > Have you actually tried to build Spark using its POM file and
>> sbt-pom-reader?
>> >  I just made a first, naive attempt, and I'm still sorting through just
>> > what this did and didn't produce.  It looks like the basic jar files are
>> at
>> > least very close to correct, and may be just fine, but that building the
>> > assembly jars failed completely.
>> >
>> > It's not completely obvious to me how to proceed with what sbt-pom-reader
>> > produces in order build the assemblies, run the test suites, etc., so I'm
>> > wondering if you have already worked out what that requires?
>> >
>> >
>> > On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan  wrote:
>> >
>> >> I'd like to propose the following way to move forward, based on the
>> >> comments I've seen:
>> >>
>> >> 1.  Aggressively clean up the giant dependency graph.   One ticket I
>> >> might work on if I have time is SPARK-681 which might remove the giant
>> >> fastutil dependency (~15MB by itself).
>> >>
>> >> 2.  Take an intermediate step by having only ONE source of truth
>> >> w.r.t. dependencies and versions.  This means either:
>> >>a)  Using a maven POM as the spec for dependencies, Hadoop version,
>> >> etc.   Then, use sbt-pom-reader to import it.
>> >>b)  Using the build.scala as the spec, and "sbt make-pom" to
>> >> generate the pom.xml for the dependencies
>> >>
>> >> The idea is to remove the pain and errors associated with manual
>> >> translation of dependency specs from one system to another, while
>> >> still maintaining the things which are hard to translate (plugins).
>> >>
>> >>
>> >> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers 
>> wrote:
>> >> > We maintain in house spark build using sbt. We have no problem using
>> sbt
>> >> > assembly. We did add a few exclude statements for transitive
>> >> dependencies.
>> >> >
>> >> > The main enemy of assemblies are jars that

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-26 Thread Evan Chan

Mark,

No, I haven't tried this myself yet  :-p   Also I would expect that
sbt-pom-reader does not do assemblies at all  because that is an
SBT plugin, so we would still need code to include sbt-assembly.
There is also the trick question of how to include the assembly stuff
into sbt-pom-reader generated projects.  So, needs much more
investigation.

My hunch is that it's easier to generate the pom from SBT (make-pom)
than the other way around.

On Wed, Feb 26, 2014 at 10:54 AM, Mark Hamstra  wrote:
> Evan,
>
> Have you actually tried to build Spark using its POM file and sbt-pom-reader?
>  I just made a first, naive attempt, and I'm still sorting through just
> what this did and didn't produce.  It looks like the basic jar files are at
> least very close to correct, and may be just fine, but that building the
> assembly jars failed completely.
>
> It's not completely obvious to me how to proceed with what sbt-pom-reader
> produces in order build the assemblies, run the test suites, etc., so I'm
> wondering if you have already worked out what that requires?
>
>
> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan  wrote:
>
>> I'd like to propose the following way to move forward, based on the
>> comments I've seen:
>>
>> 1.  Aggressively clean up the giant dependency graph.   One ticket I
>> might work on if I have time is SPARK-681 which might remove the giant
>> fastutil dependency (~15MB by itself).
>>
>> 2.  Take an intermediate step by having only ONE source of truth
>> w.r.t. dependencies and versions.  This means either:
>>a)  Using a maven POM as the spec for dependencies, Hadoop version,
>> etc.   Then, use sbt-pom-reader to import it.
>>b)  Using the build.scala as the spec, and "sbt make-pom" to
>> generate the pom.xml for the dependencies
>>
>> The idea is to remove the pain and errors associated with manual
>> translation of dependency specs from one system to another, while
>> still maintaining the things which are hard to translate (plugins).
>>
>>
>> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers  wrote:
>> > We maintain in house spark build using sbt. We have no problem using sbt
>> > assembly. We did add a few exclude statements for transitive
>> dependencies.
>> >
>> > The main enemy of assemblies are jars that include stuff they shouldn't
>> > (kryo comes to mind, I think they include logback?), new versions of jars
>> > that change the provider/artifact without changing the package (asm), and
>> > incompatible new releases (protobuf). These break the transitive
>> resolution
>> > process. I imagine that's true for any build tool.
>> >
>> > Besides shading I don't see anything maven can do sbt cannot, and if I
>> > understand it correctly shading is not done currently using the build
>> tool.
>> >
>> > Since spark is primarily scala/akka based the main developer base will be
>> > familiar with sbt (I think?). Switching build tool is always painful. I
>> > personally think it is smarter to put this burden on a limited number of
>> > upstream integrators than on the community. However that said I don't
>> think
>> > its a problem for us to maintain an sbt build in-house if spark switched
>> to
>> > maven.
>> > The problem is, the complete spark dependency graph is fairly large,
>> > and there are lot of conflicting versions in there.
>> > In particular, when we bump versions of dependencies - making managing
>> > this messy at best.
>> >
>> > Now, I have not looked in detail at how maven manages this - it might
>> > just be accidental that we get a decent out-of-the-box assembled
>> > shaded jar (since we dont do anything great to configure it).
>> > With current state of sbt in spark, it definitely is not a good
>> > solution : if we can enhance it (or it already is ?), while keeping
>> > the management of the version/dependency graph manageable, I dont have
>> > any objections to using sbt or maven !
>> > Too many exclude versions, pinned versions, etc would just make things
>> > unmanageable in future.
>> >
>> >
>> > Regards,
>> > Mridul
>> >
>> >
>> >
>> >
>> > On Wed, Feb 26, 2014 at 8:56 AM, Evan chan  wrote:
>> >> Actually you can control exactly how sbt assembly merges or resolves
>> > conflicts.  I believe the default settings however lead to order which
>> > cannot be controlled.
>> >>
>> >

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-26 Thread Evan Chan

I'd like to propose the following way to move forward, based on the
comments I've seen:

1.  Aggressively clean up the giant dependency graph.   One ticket I
might work on if I have time is SPARK-681 which might remove the giant
fastutil dependency (~15MB by itself).

2.  Take an intermediate step by having only ONE source of truth
w.r.t. dependencies and versions.  This means either:
   a)  Using a maven POM as the spec for dependencies, Hadoop version,
etc.   Then, use sbt-pom-reader to import it.
   b)  Using the build.scala as the spec, and "sbt make-pom" to
generate the pom.xml for the dependencies

The idea is to remove the pain and errors associated with manual
translation of dependency specs from one system to another, while
still maintaining the things which are hard to translate (plugins).


On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers  wrote:
> We maintain in house spark build using sbt. We have no problem using sbt
> assembly. We did add a few exclude statements for transitive dependencies.
>
> The main enemy of assemblies are jars that include stuff they shouldn't
> (kryo comes to mind, I think they include logback?), new versions of jars
> that change the provider/artifact without changing the package (asm), and
> incompatible new releases (protobuf). These break the transitive resolution
> process. I imagine that's true for any build tool.
>
> Besides shading I don't see anything maven can do sbt cannot, and if I
> understand it correctly shading is not done currently using the build tool.
>
> Since spark is primarily scala/akka based the main developer base will be
> familiar with sbt (I think?). Switching build tool is always painful. I
> personally think it is smarter to put this burden on a limited number of
> upstream integrators than on the community. However that said I don't think
> its a problem for us to maintain an sbt build in-house if spark switched to
> maven.
> The problem is, the complete spark dependency graph is fairly large,
> and there are lot of conflicting versions in there.
> In particular, when we bump versions of dependencies - making managing
> this messy at best.
>
> Now, I have not looked in detail at how maven manages this - it might
> just be accidental that we get a decent out-of-the-box assembled
> shaded jar (since we dont do anything great to configure it).
> With current state of sbt in spark, it definitely is not a good
> solution : if we can enhance it (or it already is ?), while keeping
> the management of the version/dependency graph manageable, I dont have
> any objections to using sbt or maven !
> Too many exclude versions, pinned versions, etc would just make things
> unmanageable in future.
>
>
> Regards,
> Mridul
>
>
>
>
> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan  wrote:
>> Actually you can control exactly how sbt assembly merges or resolves
> conflicts.  I believe the default settings however lead to order which
> cannot be controlled.
>>
>> I do wish for a smarter fat jar plugin.
>>
>> -Evan
>> To be free is not merely to cast off one's chains, but to live in a way
> that respects & enhances the freedom of others. (#NelsonMandela)
>>
>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan 
> wrote:
>>>
>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell 
> wrote:
>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>>>> right now we don't actually use it for bytecode shading - we simply
>>>> use it for creating the uber jar with excludes (which sbt supports
>>>> just fine via assembly).
>>>
>>>
>>> Not really - as I mentioned initially in this thread, sbt's assembly
>>> does not take dependencies into account properly : and can overwrite
>>> newer classes with older versions.
>>> From an assembly point of view, sbt is not very good : we are yet to
>>> try it after 2.10 shift though (and probably wont, given the mess it
>>> created last time).
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> I was wondering actually, do you know if it's possible to added shaded
>>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>>>> jar)? That's something I could see being really handy in the future.
>>>>
>>>> - Patrick
>>>>
>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan  wrote:
>>>>> The problem is that plugins are not equivalent.  There is AFAIK no
>>>>> equivalent to the maven shader plugin for SBT.
>>>&

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan chan

Actually you can control exactly how sbt assembly merges or resolves conflicts. 
 I believe the default settings however lead to order which cannot be 
controlled. 

I do wish for a smarter fat jar plugin.  

-Evan
To be free is not merely to cast off one's chains, but to live in a way that 
respects & enhances the freedom of others. (#NelsonMandela)

> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan  wrote:
> 
>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell  wrote:
>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>> right now we don't actually use it for bytecode shading - we simply
>> use it for creating the uber jar with excludes (which sbt supports
>> just fine via assembly).
> 
> 
> Not really - as I mentioned initially in this thread, sbt's assembly
> does not take dependencies into account properly : and can overwrite
> newer classes with older versions.
> From an assembly point of view, sbt is not very good : we are yet to
> try it after 2.10 shift though (and probably wont, given the mess it
> created last time).
> 
> Regards,
> Mridul
> 
> 
> 
> 
> 
>> 
>> I was wondering actually, do you know if it's possible to added shaded
>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>> jar)? That's something I could see being really handy in the future.
>> 
>> - Patrick
>> 
>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan  wrote:
>>> The problem is that plugins are not equivalent.  There is AFAIK no
>>> equivalent to the maven shader plugin for SBT.
>>> There is an SBT plugin which can apparently read POM XML files
>>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>>> is still problematic.
>>> 
>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao  wrote:
>>>> I would prefer keep both of them, it would be better even if that means
>>>> pom.xml will be generated using sbt. Some company, like my current one,
>>>> have their own build infrastructures built on top of maven. It is not easy
>>>> to support sbt for these potential spark clients. But I do agree to only
>>>> keep one if there is a promising way to generate correct configuration from
>>>> the other.
>>>> 
>>>> -Shengzhe
>>>> 
>>>> 
>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan  wrote:
>>>>> 
>>>>> The correct way to exclude dependencies in SBT is actually to declare
>>>>> a dependency as "provided".   I'm not familiar with Maven or its
>>>>> dependencySet, but provided will mark the entire dependency tree as
>>>>> excluded.   It is also possible to exclude jar by jar, but this is
>>>>> pretty error prone and messy.
>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers  wrote:
>>>>>> yes in sbt assembly you can exclude jars (although i never had a need for
>>>>>> this) and files in jars.
>>>>>> 
>>>>>> for example i frequently remove log4j.properties, because for whatever
>>>>>> reason hadoop decided to include it making it very difficult to use our
>>>>> own
>>>>>> logging config.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik 
>>>>>> wrote:
>>>>>> 
>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>>>>>>>> Kos - thanks for chiming in. Could you be more specific about what is
>>>>>>>> available in maven and not in sbt for these issues? I took a look at
>>>>>>>> the bigtop code relating to Spark. As far as I could tell [1] was the
>>>>>>>> main point of integration with the build system (maybe there are other
>>>>>>>> integration points)?
>>>>>>>> 
>>>>>>>>>  - in order to integrate Spark well into existing Hadoop stack it
>>>>> was
>>>>>>>>>necessary to have a way to avoid transitive dependencies
>>>>>>> duplications and
>>>>>>>>>possible conflicts.
>>>>>>>>> 
>>>>>>>>>E.g. Maven assembly allows us to avoid adding _all_ Hadoop libs
>>>>>>> and later
>>>>>>>>>merely declare Spark package dependency on st

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan Chan

Sandy, I believe the sbt-pom-reader plugin might work very well for
this exact use case.   Otherwise, the SBT build file is just Scala
code, so it can easily read the pom XML directly if needed and parse
stuff out.

On Tue, Feb 25, 2014 at 4:36 PM, Sandy Ryza  wrote:
> To perhaps restate what some have said, Maven is by far the most common
> build tool for the Hadoop / JVM data ecosystem.  While Maven is less pretty
> than SBT, expertise in it is abundant.  SBT requires contributors to
> projects in the ecosystem to learn yet another tool.  If we think of Spark
> as a project in that ecosystem that happens to be in Scala, as opposed to a
> Scala project that happens to be part of that ecosystem, Maven seems like
> the better choice to me.
>
> On a CDH-specific note, in building CDH, one of the reasons Maven is
> helpful to us is that it makes it easy to harmonize dependency versions
> across projects.  We modify project poms to include the "CDH" pom as a root
> pom, allowing each project to reference variables defined in the root pom
> like ${cdh.slf4j.version}.  Is there a way to make an SBT project inherit
> from a Maven project that would allow this kind of thing?
>
> -Sandy
>
>
> On Tue, Feb 25, 2014 at 4:23 PM, Evan Chan  wrote:
>
>> Hi Patrick,
>>
>> If you include shaded dependencies inside of the main Spark jar, such
>> that it would have combined classes from all dependencies, wouldn't
>> you end up with a sub-assembly jar?  It would be dangerous in that
>> since it is a single unit, it would break normal packaging assumptions
>> that the jar only contains its own classes, and maven/sbt/ivy/etc is
>> used to resolve the remaining deps but maybe I don't know what you
>> mean.
>>
>> The shader plugin in maven is apparently used to
>> 1) build uber jars  - this is the part that sbt-assembly also does
>> 2) "shade" existing jars, ie rename the classes and rewrite bytecode
>> depending on them such that it doesn't conflict with other jars having
>> the same classes  -- this is something sbt-assembly doesn't do, which
>> you point out is done manually.
>>
>>
>>
>> On Tue, Feb 25, 2014 at 4:09 PM, Patrick Wendell 
>> wrote:
>> > What I mean is this. AFIAK the shader plug-in is primarily designed
>> > for creating uber jars which contain spark and all dependencies. But
>> > since Spark is something people depend on in Maven, what I actually
>> > want is to create the normal old Spark jar [1], but then include
>> > shaded versions of some of our dependencies inside of it. Not sure if
>> > that's even possible.
>> >
>> > The way we do shading now is we manually publish shaded versions of
>> > some dependencies to maven central as their own artifacts.
>> >
>> >
>> http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-core_2.10/0.9.0-incubating/spark-core_2.10-0.9.0-incubating.jar
>> >
>> > On Tue, Feb 25, 2014 at 4:04 PM, Evan Chan  wrote:
>> >> Patrick -- not sure I understand your request, do you mean
>> >> - somehow creating a shaded jar (eg with maven shader plugin)
>> >> - then including it in the spark jar (which would then be an assembly)?
>> >>
>> >> On Tue, Feb 25, 2014 at 4:01 PM, Patrick Wendell 
>> wrote:
>> >>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>> >>> right now we don't actually use it for bytecode shading - we simply
>> >>> use it for creating the uber jar with excludes (which sbt supports
>> >>> just fine via assembly).
>> >>>
>> >>> I was wondering actually, do you know if it's possible to added shaded
>> >>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>> >>> jar)? That's something I could see being really handy in the future.
>> >>>
>> >>> - Patrick
>> >>>
>> >>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan  wrote:
>> >>>> The problem is that plugins are not equivalent.  There is AFAIK no
>> >>>> equivalent to the maven shader plugin for SBT.
>> >>>> There is an SBT plugin which can apparently read POM XML files
>> >>>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>> >>>> is still problematic.
>> >>>>
>> >>>> On Tue, Feb 25, 2014 at 3:31 PM, yao  wrote:
>> >>>>> I would prefer keep both of them, it would be better even if that
>> means
&

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan Chan

Hi Patrick,

If you include shaded dependencies inside of the main Spark jar, such
that it would have combined classes from all dependencies, wouldn't
you end up with a sub-assembly jar?  It would be dangerous in that
since it is a single unit, it would break normal packaging assumptions
that the jar only contains its own classes, and maven/sbt/ivy/etc is
used to resolve the remaining deps but maybe I don't know what you
mean.

The shader plugin in maven is apparently used to
1) build uber jars  - this is the part that sbt-assembly also does
2) "shade" existing jars, ie rename the classes and rewrite bytecode
depending on them such that it doesn't conflict with other jars having
the same classes  -- this is something sbt-assembly doesn't do, which
you point out is done manually.



On Tue, Feb 25, 2014 at 4:09 PM, Patrick Wendell  wrote:
> What I mean is this. AFIAK the shader plug-in is primarily designed
> for creating uber jars which contain spark and all dependencies. But
> since Spark is something people depend on in Maven, what I actually
> want is to create the normal old Spark jar [1], but then include
> shaded versions of some of our dependencies inside of it. Not sure if
> that's even possible.
>
> The way we do shading now is we manually publish shaded versions of
> some dependencies to maven central as their own artifacts.
>
> http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-core_2.10/0.9.0-incubating/spark-core_2.10-0.9.0-incubating.jar
>
> On Tue, Feb 25, 2014 at 4:04 PM, Evan Chan  wrote:
>> Patrick -- not sure I understand your request, do you mean
>> - somehow creating a shaded jar (eg with maven shader plugin)
>> - then including it in the spark jar (which would then be an assembly)?
>>
>> On Tue, Feb 25, 2014 at 4:01 PM, Patrick Wendell  wrote:
>>> Evan - this is a good thing to bring up. Wrt the shader plug-in -
>>> right now we don't actually use it for bytecode shading - we simply
>>> use it for creating the uber jar with excludes (which sbt supports
>>> just fine via assembly).
>>>
>>> I was wondering actually, do you know if it's possible to added shaded
>>> artifacts to the *spark jar* using this plug-in (e.g. not an uber
>>> jar)? That's something I could see being really handy in the future.
>>>
>>> - Patrick
>>>
>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan  wrote:
>>>> The problem is that plugins are not equivalent.  There is AFAIK no
>>>> equivalent to the maven shader plugin for SBT.
>>>> There is an SBT plugin which can apparently read POM XML files
>>>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>>>> is still problematic.
>>>>
>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao  wrote:
>>>>> I would prefer keep both of them, it would be better even if that means
>>>>> pom.xml will be generated using sbt. Some company, like my current one,
>>>>> have their own build infrastructures built on top of maven. It is not easy
>>>>> to support sbt for these potential spark clients. But I do agree to only
>>>>> keep one if there is a promising way to generate correct configuration 
>>>>> from
>>>>> the other.
>>>>>
>>>>> -Shengzhe
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan  wrote:
>>>>>
>>>>>> The correct way to exclude dependencies in SBT is actually to declare
>>>>>> a dependency as "provided".   I'm not familiar with Maven or its
>>>>>> dependencySet, but provided will mark the entire dependency tree as
>>>>>> excluded.   It is also possible to exclude jar by jar, but this is
>>>>>> pretty error prone and messy.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers  wrote:
>>>>>> > yes in sbt assembly you can exclude jars (although i never had a need 
>>>>>> > for
>>>>>> > this) and files in jars.
>>>>>> >
>>>>>> > for example i frequently remove log4j.properties, because for whatever
>>>>>> > reason hadoop decided to include it making it very difficult to use our
>>>>>> own
>>>>>> > logging config.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik 
>>>>>> wrote:
>>

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan Chan

Patrick -- not sure I understand your request, do you mean
- somehow creating a shaded jar (eg with maven shader plugin)
- then including it in the spark jar (which would then be an assembly)?

On Tue, Feb 25, 2014 at 4:01 PM, Patrick Wendell  wrote:
> Evan - this is a good thing to bring up. Wrt the shader plug-in -
> right now we don't actually use it for bytecode shading - we simply
> use it for creating the uber jar with excludes (which sbt supports
> just fine via assembly).
>
> I was wondering actually, do you know if it's possible to added shaded
> artifacts to the *spark jar* using this plug-in (e.g. not an uber
> jar)? That's something I could see being really handy in the future.
>
> - Patrick
>
> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan  wrote:
>> The problem is that plugins are not equivalent.  There is AFAIK no
>> equivalent to the maven shader plugin for SBT.
>> There is an SBT plugin which can apparently read POM XML files
>> (sbt-pom-reader).   However, it can't possibly handle plugins, which
>> is still problematic.
>>
>> On Tue, Feb 25, 2014 at 3:31 PM, yao  wrote:
>>> I would prefer keep both of them, it would be better even if that means
>>> pom.xml will be generated using sbt. Some company, like my current one,
>>> have their own build infrastructures built on top of maven. It is not easy
>>> to support sbt for these potential spark clients. But I do agree to only
>>> keep one if there is a promising way to generate correct configuration from
>>> the other.
>>>
>>> -Shengzhe
>>>
>>>
>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan  wrote:
>>>
>>>> The correct way to exclude dependencies in SBT is actually to declare
>>>> a dependency as "provided".   I'm not familiar with Maven or its
>>>> dependencySet, but provided will mark the entire dependency tree as
>>>> excluded.   It is also possible to exclude jar by jar, but this is
>>>> pretty error prone and messy.
>>>>
>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers  wrote:
>>>> > yes in sbt assembly you can exclude jars (although i never had a need for
>>>> > this) and files in jars.
>>>> >
>>>> > for example i frequently remove log4j.properties, because for whatever
>>>> > reason hadoop decided to include it making it very difficult to use our
>>>> own
>>>> > logging config.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik 
>>>> wrote:
>>>> >
>>>> >> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>>>> >> > Kos - thanks for chiming in. Could you be more specific about what is
>>>> >> > available in maven and not in sbt for these issues? I took a look at
>>>> >> > the bigtop code relating to Spark. As far as I could tell [1] was the
>>>> >> > main point of integration with the build system (maybe there are other
>>>> >> > integration points)?
>>>> >> >
>>>> >> > >   - in order to integrate Spark well into existing Hadoop stack it
>>>> was
>>>> >> > > necessary to have a way to avoid transitive dependencies
>>>> >> duplications and
>>>> >> > > possible conflicts.
>>>> >> > >
>>>> >> > > E.g. Maven assembly allows us to avoid adding _all_ Hadoop libs
>>>> >> and later
>>>> >> > > merely declare Spark package dependency on standard Bigtop
>>>> Hadoop
>>>> >> > > packages. And yes - Bigtop packaging means the naming and layout
>>>> >> would be
>>>> >> > > standard across all commercial Hadoop distributions that are
>>>> worth
>>>> >> > > mentioning: ASF Bigtop convenience binary packages, and
>>>> Cloudera or
>>>> >> > > Hortonworks packages. Hence, the downstream user doesn't need to
>>>> >> spend any
>>>> >> > > effort to make sure that Spark "clicks-in" properly.
>>>> >> >
>>>> >> > The sbt build also allows you to plug in a Hadoop version similar to
>>>> >> > the maven build.
>>>> >>
>>>> >> I am actually talking about an ability to exclude a

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan Chan

The problem is that plugins are not equivalent.  There is AFAIK no
equivalent to the maven shader plugin for SBT.
There is an SBT plugin which can apparently read POM XML files
(sbt-pom-reader).   However, it can't possibly handle plugins, which
is still problematic.

On Tue, Feb 25, 2014 at 3:31 PM, yao  wrote:
> I would prefer keep both of them, it would be better even if that means
> pom.xml will be generated using sbt. Some company, like my current one,
> have their own build infrastructures built on top of maven. It is not easy
> to support sbt for these potential spark clients. But I do agree to only
> keep one if there is a promising way to generate correct configuration from
> the other.
>
> -Shengzhe
>
>
> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan  wrote:
>
>> The correct way to exclude dependencies in SBT is actually to declare
>> a dependency as "provided".   I'm not familiar with Maven or its
>> dependencySet, but provided will mark the entire dependency tree as
>> excluded.   It is also possible to exclude jar by jar, but this is
>> pretty error prone and messy.
>>
>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers  wrote:
>> > yes in sbt assembly you can exclude jars (although i never had a need for
>> > this) and files in jars.
>> >
>> > for example i frequently remove log4j.properties, because for whatever
>> > reason hadoop decided to include it making it very difficult to use our
>> own
>> > logging config.
>> >
>> >
>> >
>> > On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik 
>> wrote:
>> >
>> >> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>> >> > Kos - thanks for chiming in. Could you be more specific about what is
>> >> > available in maven and not in sbt for these issues? I took a look at
>> >> > the bigtop code relating to Spark. As far as I could tell [1] was the
>> >> > main point of integration with the build system (maybe there are other
>> >> > integration points)?
>> >> >
>> >> > >   - in order to integrate Spark well into existing Hadoop stack it
>> was
>> >> > > necessary to have a way to avoid transitive dependencies
>> >> duplications and
>> >> > > possible conflicts.
>> >> > >
>> >> > > E.g. Maven assembly allows us to avoid adding _all_ Hadoop libs
>> >> and later
>> >> > > merely declare Spark package dependency on standard Bigtop
>> Hadoop
>> >> > > packages. And yes - Bigtop packaging means the naming and layout
>> >> would be
>> >> > > standard across all commercial Hadoop distributions that are
>> worth
>> >> > > mentioning: ASF Bigtop convenience binary packages, and
>> Cloudera or
>> >> > > Hortonworks packages. Hence, the downstream user doesn't need to
>> >> spend any
>> >> > > effort to make sure that Spark "clicks-in" properly.
>> >> >
>> >> > The sbt build also allows you to plug in a Hadoop version similar to
>> >> > the maven build.
>> >>
>> >> I am actually talking about an ability to exclude a set of dependencies
>> >> from an
>> >> assembly, similarly to what's happening in dependencySet sections of
>> >> assembly/src/main/assembly/assembly.xml
>> >> If there is a comparable functionality in Sbt, that would help quite a
>> bit,
>> >> apparently.
>> >>
>> >> Cos
>> >>
>> >> > >   - Maven provides a relatively easy way to deal with the jar-hell
>> >> problem,
>> >> > > although the original maven build was just Shader'ing everything
>> >> into a
>> >> > > huge lump of class files. Oftentimes ending up with classes
>> >> slamming on
>> >> > > top of each other from different transitive dependencies.
>> >> >
>> >> > AFIAK we are only using the shade plug-in to deal with conflict
>> >> > resolution in the assembly jar. These are dealt with in sbt via the
>> >> > sbt assembly plug-in in an identical way. Is there a difference?
>> >>
>> >> I am bringing up the Sharder, because it is an awful hack, which is
>> can't
>> >> be
>> >> used in real controlled deployment.
>> >>
>> >> Cos
>> >>
>> >> > [1]
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>> >>
>>
>>
>>
>> --
>> --
>> Evan Chan
>> Staff Engineer
>> e...@ooyala.com  |
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-25 Thread Evan Chan

The correct way to exclude dependencies in SBT is actually to declare
a dependency as "provided".   I'm not familiar with Maven or its
dependencySet, but provided will mark the entire dependency tree as
excluded.   It is also possible to exclude jar by jar, but this is
pretty error prone and messy.

On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers  wrote:
> yes in sbt assembly you can exclude jars (although i never had a need for
> this) and files in jars.
>
> for example i frequently remove log4j.properties, because for whatever
> reason hadoop decided to include it making it very difficult to use our own
> logging config.
>
>
>
> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik  wrote:
>
>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>> > Kos - thanks for chiming in. Could you be more specific about what is
>> > available in maven and not in sbt for these issues? I took a look at
>> > the bigtop code relating to Spark. As far as I could tell [1] was the
>> > main point of integration with the build system (maybe there are other
>> > integration points)?
>> >
>> > >   - in order to integrate Spark well into existing Hadoop stack it was
>> > > necessary to have a way to avoid transitive dependencies
>> duplications and
>> > > possible conflicts.
>> > >
>> > > E.g. Maven assembly allows us to avoid adding _all_ Hadoop libs
>> and later
>> > > merely declare Spark package dependency on standard Bigtop Hadoop
>> > > packages. And yes - Bigtop packaging means the naming and layout
>> would be
>> > > standard across all commercial Hadoop distributions that are worth
>> > > mentioning: ASF Bigtop convenience binary packages, and Cloudera or
>> > > Hortonworks packages. Hence, the downstream user doesn't need to
>> spend any
>> > > effort to make sure that Spark "clicks-in" properly.
>> >
>> > The sbt build also allows you to plug in a Hadoop version similar to
>> > the maven build.
>>
>> I am actually talking about an ability to exclude a set of dependencies
>> from an
>> assembly, similarly to what's happening in dependencySet sections of
>> assembly/src/main/assembly/assembly.xml
>> If there is a comparable functionality in Sbt, that would help quite a bit,
>> apparently.
>>
>> Cos
>>
>> > >   - Maven provides a relatively easy way to deal with the jar-hell
>> problem,
>> > > although the original maven build was just Shader'ing everything
>> into a
>> > > huge lump of class files. Oftentimes ending up with classes
>> slamming on
>> > > top of each other from different transitive dependencies.
>> >
>> > AFIAK we are only using the shade plug-in to deal with conflict
>> > resolution in the assembly jar. These are dealt with in sbt via the
>> > sbt assembly plug-in in an identical way. Is there a difference?
>>
>> I am bringing up the Sharder, because it is an awful hack, which is can't
>> be
>> used in real controlled deployment.
>>
>> Cos
>>
>> > [1]
>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>>



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

59 matches

Mail list logo