Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Dean Wampler
Congrats, Jose!


*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*


On Tue, Jan 29, 2019 at 12:52 PM Burak Yavuz  wrote:

> Congrats Jose!
>
> On Tue, Jan 29, 2019 at 10:50 AM Xiao Li  wrote:
>
>> Congratulations!
>>
>> Xiao
>>
>> Shixiong Zhu  于2019年1月29日周二 上午10:48写道:
>>
>>> Hi all,
>>>
>>> The Apache Spark PMC recently added Jose Torres as a committer on the
>>> project. Jose has been a major contributor to Structured Streaming. Please
>>> join me in welcoming him!
>>>
>>> Best Regards,
>>>
>>> Shixiong Zhu
>>>
>>>


Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Dean Wampler
I spoke with the Scala team at Lightbend. They plan to do a 2.13-RC1
release in January and GA a few months later. Of course, nothing is ever
certain. What's the thinking for the Spark 3.0 timeline? If it's likely to
be late Q1 or in Q2, then it might make sense to add Scala 2.13 as an
alternative Scala version.

dean


*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
https://www.linkedin.com/in/deanwampler/
http://polyglotprogramming.com
https://github.com/deanwampler
https://www.flickr.com/photos/deanwampler/


On Tue, Nov 6, 2018 at 7:48 PM Sean Owen  wrote:

> That's possible here, sure. The issue is: would you exclude Scala 2.13
> support in 3.0 for this, if it were otherwise ready to go?
> I think it's not a hard rule that something has to be deprecated
> previously to be removed in a major release. The notice is helpful,
> sure, but there are lots of ways to provide that notice to end users.
> Lots of things are breaking changes in a major release. Or: deprecate
> in Spark 2.4.1, if desired?
>
> On Tue, Nov 6, 2018 at 7:36 PM Wenchen Fan  wrote:
> >
> > We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in
> Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark
> 3.x?
> >
> > On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin  wrote:
> >>
> >> Have we deprecated Scala 2.11 already in an existing release?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Scala 2.12 support

2018-06-07 Thread Dean Wampler
Do the tests expect a particular console output order? That would annoy
them. ;) You could sort the expected and output lines, then diff...


*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Jun 7, 2018 at 5:09 PM, Holden Karau  wrote:

> If the difference is the order of the welcome message I think that should
> be fine.
>
> On Thu, Jun 7, 2018, 4:43 PM Dean Wampler  wrote:
>
>> I'll point the Scala team to this issue, but it's unlikely to get fixed
>> any time soon.
>>
>> dean
>>
>>
>> *Dean Wampler, Ph.D.*
>>
>> *VP, Fast Data Engineering at Lightbend*
>> Author: Programming Scala, 2nd Edition
>> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data
>> Architectures for Streaming Applications
>> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
>> and other content from O'Reilly
>> @deanwampler <http://twitter.com/deanwampler>
>> http://polyglotprogramming.com
>> https://github.com/deanwampler
>>
>> On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai  wrote:
>>
>>> Thanks Felix for bringing this up.
>>>
>>> Currently, in Scala 2.11.8, we initialize the Spark by overriding
>>> loadFIles() before REPL sees any file since there is no good hook in Scala
>>> to load our initialization code.
>>>
>>> In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles()
>>> method was removed.
>>>
>>> Alternatively, one way we can do in the newer version of Scala is by
>>> overriding initializeSynchronous() suggested by Som Snytt; I have a working
>>> PR with this approach,
>>> https://github.com/apache/spark/pull/21495 , and this approach should
>>> work for older version of Scala too.
>>>
>>> However, in the newer version of Scala, the first thing that the REPL
>>> calls is printWelcome, so in the newer version of Scala, welcome message
>>> will be shown and then the URL of the SparkUI in this approach. This will
>>> cause UI inconsistencies between different versions of Scala.
>>>
>>> We can also initialize the Spark in the printWelcome which I feel more
>>> hacky. It will only work for newer version of Scala since in order version
>>> of Scala, printWelcome is called in the end of the initialization process.
>>> If we decide to go this route, basically users can not use Scala older than
>>> 2.11.9.
>>>
>>> I think this is also a blocker for us to move to newer version of Scala
>>> 2.12.x since the newer version of Scala 2.12.x has the same issue.
>>>
>>> In my opinion, Scala should fix the root cause and provide a stable hook
>>> for 3rd party developers to initialize their custom code.
>>>
>>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
>>> Apple, Inc
>>>
>>> > On Jun 7, 2018, at 6:43 AM, Felix Cheung 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > Spoke to Dean as well and mentioned the problem with 2.11.12
>>> https://github.com/scala/bug/issues/10913
>>> >
>>> > _
>>> > From: Sean Owen 
>>> > Sent: Wednesday, June 6, 2018 12:23 PM
>>> > Subject: Re: Scala 2.12 support
>>> > To: Holden Karau 
>>> > Cc: Dean Wampler , Reynold Xin <
>>> r...@databricks.com>, dev 
>>> >
>>> >
>>> > If it means no change to 2.11 support, seems OK to me for Spark 2.4.0.
>>> The 2.12 support is separate and has never been mutually compatible with
>>> 2.11 builds anyway. (I also hope, suspect that the changes are minimal;
>>> tests are already almost entirely passing with no change to the closure
>>> cleaner when built for 2.12)
>>> >
>>> > On Wed, Jun 6, 2018 at 1:33 PM Holden Karau 
>>> wrote:
>>> > Just chatted with Dean @ the summit and it sounds like from Adriaan
>>> there is a fix in 2.13 for the API change issue that could be back ported
>>> to 2.12 so how about we try and get this ball rolling?
>>> >
>>> > It sounds like it would also need a closure cleaner change, which
>>> could be backwards compatible but since it’s such a core component and we
>>> might want to be cautious with it, we could when building for 2.11 use the
>>> old cleaner code and for 2.12 use the new code so we don’t break anyone.
>>> >
>>> > How do folks feel about this?
>>> >
>>> >
>>> >
>>>
>>>
>>


Re: Scala 2.12 support

2018-06-07 Thread Dean Wampler
I'll point the Scala team to this issue, but it's unlikely to get fixed any
time soon.

dean


*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai  wrote:

> Thanks Felix for bringing this up.
>
> Currently, in Scala 2.11.8, we initialize the Spark by overriding
> loadFIles() before REPL sees any file since there is no good hook in Scala
> to load our initialization code.
>
> In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles() method
> was removed.
>
> Alternatively, one way we can do in the newer version of Scala is by
> overriding initializeSynchronous() suggested by Som Snytt; I have a working
> PR with this approach,
> https://github.com/apache/spark/pull/21495 , and this approach should
> work for older version of Scala too.
>
> However, in the newer version of Scala, the first thing that the REPL
> calls is printWelcome, so in the newer version of Scala, welcome message
> will be shown and then the URL of the SparkUI in this approach. This will
> cause UI inconsistencies between different versions of Scala.
>
> We can also initialize the Spark in the printWelcome which I feel more
> hacky. It will only work for newer version of Scala since in order version
> of Scala, printWelcome is called in the end of the initialization process.
> If we decide to go this route, basically users can not use Scala older than
> 2.11.9.
>
> I think this is also a blocker for us to move to newer version of Scala
> 2.12.x since the newer version of Scala 2.12.x has the same issue.
>
> In my opinion, Scala should fix the root cause and provide a stable hook
> for 3rd party developers to initialize their custom code.
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
> > On Jun 7, 2018, at 6:43 AM, Felix Cheung 
> wrote:
> >
> > +1
> >
> > Spoke to Dean as well and mentioned the problem with 2.11.12
> https://github.com/scala/bug/issues/10913
> >
> > _
> > From: Sean Owen 
> > Sent: Wednesday, June 6, 2018 12:23 PM
> > Subject: Re: Scala 2.12 support
> > To: Holden Karau 
> > Cc: Dean Wampler , Reynold Xin <
> r...@databricks.com>, dev 
> >
> >
> > If it means no change to 2.11 support, seems OK to me for Spark 2.4.0.
> The 2.12 support is separate and has never been mutually compatible with
> 2.11 builds anyway. (I also hope, suspect that the changes are minimal;
> tests are already almost entirely passing with no change to the closure
> cleaner when built for 2.12)
> >
> > On Wed, Jun 6, 2018 at 1:33 PM Holden Karau 
> wrote:
> > Just chatted with Dean @ the summit and it sounds like from Adriaan
> there is a fix in 2.13 for the API change issue that could be back ported
> to 2.12 so how about we try and get this ball rolling?
> >
> > It sounds like it would also need a closure cleaner change, which could
> be backwards compatible but since it’s such a core component and we might
> want to be cautious with it, we could when building for 2.11 use the old
> cleaner code and for 2.12 use the new code so we don’t break anyone.
> >
> > How do folks feel about this?
> >
> >
> >
>
>


Re: Scala 2.12 support

2018-04-21 Thread Dean Wampler
Hi, Reynold,

Sorry for the delay in replying; I was traveling.

The Scala changes would avoid the need to change the API now. Basically,
the compiler would be modified to detect the particular case of the two
ambiguous, overloaded methods, then pick the best fit in a more
"intelligent" way. (They can provide more specific details). This would not
address the closure cleaner changes required. However, the Scala team
offered to provide suggestions or review changes.

dean


*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Apr 19, 2018 at 6:46 PM, Reynold Xin <r...@databricks.com> wrote:

> Forking the thread to focus on Scala 2.12.
>
> Dean,
>
> There are couple different issues with Scala 2.12 (closure cleaner, API
> breaking changes). Which one do you think we can address with a Scala
> upgrade? (The closure cleaner one I haven't spent a lot of time looking at
> it but it might involve more Spark side changes)
>
> On Thu, Apr 19, 2018 at 3:28 AM, Dean Wampler <deanwamp...@gmail.com>
> wrote:
>
>> I spoke with Martin Odersky and Lightbend's Scala Team about the known
>> API issue with method disambiguation. They offered to implement a small
>> patch in a new release of Scala 2.12 to handle the issue without requiring
>> a Spark API change. They would cut a 2.12.6 release for it. I'm told that
>> Scala 2.13 should already handle the issue without modification (it's not
>> yet released, to be clear). They can also offer feedback on updating the
>> closure cleaner.
>>
>> So, this approach would support Scala 2.12 in Spark, but limited to
>> 2.12.6+, without the API change requirement, but the closure cleaner would
>> still need updating. Hence, it could be done for Spark 2.X.
>>
>> Let me if you want to pursue this approach.
>>
>> dean
>>
>>
>>
>>
>> *Dean Wampler, Ph.D.*
>>
>> *VP, Fast Data Engineering at Lightbend*
>> Author: Programming Scala, 2nd Edition
>> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data
>> Architectures for Streaming Applications
>> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
>> and other content from O'Reilly
>> @deanwampler <http://twitter.com/deanwampler>
>> http://polyglotprogramming.com
>> https://github.com/deanwampler
>>
>> On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com>
>>> wrote:
>>> > Sorry, but just to be clear here, this is the 2.12 API issue:
>>> https://issues.apache.org/jira/browse/SPARK-14643, with more details in
>>> this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HK
>>> ixuNdxSEvo8nw_tgLgM/edit.
>>> >
>>> > Basically, if we are allowed to change Spark’s API a little to have
>>> only one version of methods that are currently overloaded between Java and
>>> Scala, we can get away with a single source three for all Scala versions
>>> and Java ABI compatibility against any type of Spark (whether using Scala
>>> 2.11 or 2.12).
>>>
>>> Fair enough. To play devil's advocate, most of those methods seem to
>>> be marked "Experimental / Evolving", which could be used as a reason
>>> to change them for this purpose in a minor release.
>>>
>>> Not all of them are, though (e.g. foreach / foreachPartition are not
>>> experimental).
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: time for Apache Spark 3.0?

2018-04-19 Thread Dean Wampler
I spoke with Martin Odersky and Lightbend's Scala Team about the known API
issue with method disambiguation. They offered to implement a small patch
in a new release of Scala 2.12 to handle the issue without requiring a
Spark API change. They would cut a 2.12.6 release for it. I'm told that
Scala 2.13 should already handle the issue without modification (it's not
yet released, to be clear). They can also offer feedback on updating the
closure cleaner.

So, this approach would support Scala 2.12 in Spark, but limited to
2.12.6+, without the API change requirement, but the closure cleaner would
still need updating. Hence, it could be done for Spark 2.X.

Let me if you want to pursue this approach.

dean




*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> > Sorry, but just to be clear here, this is the 2.12 API issue:
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in
> this doc: https://docs.google.com/document/d/1P_
> wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> >
> > Basically, if we are allowed to change Spark’s API a little to have only
> one version of methods that are currently overloaded between Java and
> Scala, we can get away with a single source three for all Scala versions
> and Java ABI compatibility against any type of Spark (whether using Scala
> 2.11 or 2.12).
>
> Fair enough. To play devil's advocate, most of those methods seem to
> be marked "Experimental / Evolving", which could be used as a reason
> to change them for this purpose in a minor release.
>
> Not all of them are, though (e.g. foreach / foreachPartition are not
> experimental).
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: welcoming Burak and Holden as committers

2017-01-24 Thread Dean Wampler
Congratulations to both of you!

dean

*Dean Wampler, Ph.D.*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
Functional Programming for Java Developers
<http://shop.oreilly.com/product/0636920021667.do>, and Programming Hive
<http://shop.oreilly.com/product/0636920023555.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li <gatorsm...@gmail.com> wrote:

> Congratulations! Burak and Holden!
>
> 2017-01-24 10:13 GMT-08:00 Reynold Xin <r...@databricks.com>:
>
>> Hi all,
>>
>> Burak and Holden have recently been elected as Apache Spark committers.
>>
>> Burak has been very active in a large number of areas in Spark, including
>> linear algebra, stats/maths functions in DataFrames, Python/R APIs for
>> DataFrames, dstream, and most recently Structured Streaming.
>>
>> Holden has been a long time Spark contributor and evangelist. She has
>> written a few books on Spark, as well as frequent contributions to the
>> Python API to improve its usability and performance.
>>
>> Please join me in welcoming the two!
>>
>>
>>
>


Re: Apache Spark chat channel

2016-10-16 Thread Dean Wampler
Okay, here is a Gitter room for this purpose:

https://gitter.im/spark-scala/Lobby

If you use the APIs, please join and help those who are learning. I can't
answer every question.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Oct 6, 2016 at 9:21 AM, Dean Wampler <deanwamp...@gmail.com> wrote:

> Since I'm a Scala Spark advocate, I'll try to get a Scala Spark Gitter
> channel created, one way or another.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Lightbend <http://lightbend.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Thu, Oct 6, 2016 at 8:36 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes this come up once in a while. There's no need or way to stop people
>> forming groups to chat, though blessing a new channel as 'official' is
>> tough because it means, in theory, everyone has to follow another channel
>> to see 100% of the discussion. I think that's why the couple of mailing
>> lists, which can be controlled and archived by the ASF, will stay the
>> official channels. But, naturally there's no problem with people forming
>> unofficial communities.
>>
>> On Thu, Oct 6, 2016 at 2:33 PM Jan-Hendrik Zab <j...@jhz.name> wrote:
>>
>>> Hello!
>>>
>>> There was a request on scala-debate [0] to create a Spark centric chat
>>> room under the scala namespace on Gitter with a focus on Scala related
>>> questions.
>>>
>>> This is just a heads up to the Apache Spark "management" to give them a
>>> chance to get involved. It might be better to create a dedicated channel
>>> under the Apache umbrella to better serve all users and not only those
>>> using Scala. Avoiding any artificial split of the Spark community.
>>> Reasons for having such a channel can be found in the linked thread.
>>>
>>> ps.
>>> Please CC me, since I'm not on the list.
>>>
>>> Best,
>>> -jhz
>>>
>>> (Resent, something apparently ate my first e-mail.)
>>>
>>> [0] - https://groups.google.com/forum/#!topic/scala-debate/OVGnIU2SNmc
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>


Re: Apache Spark chat channel

2016-10-06 Thread Dean Wampler
Since I'm a Scala Spark advocate, I'll try to get a Scala Spark Gitter
channel created, one way or another.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Oct 6, 2016 at 8:36 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes this come up once in a while. There's no need or way to stop people
> forming groups to chat, though blessing a new channel as 'official' is
> tough because it means, in theory, everyone has to follow another channel
> to see 100% of the discussion. I think that's why the couple of mailing
> lists, which can be controlled and archived by the ASF, will stay the
> official channels. But, naturally there's no problem with people forming
> unofficial communities.
>
> On Thu, Oct 6, 2016 at 2:33 PM Jan-Hendrik Zab <j...@jhz.name> wrote:
>
>> Hello!
>>
>> There was a request on scala-debate [0] to create a Spark centric chat
>> room under the scala namespace on Gitter with a focus on Scala related
>> questions.
>>
>> This is just a heads up to the Apache Spark "management" to give them a
>> chance to get involved. It might be better to create a dedicated channel
>> under the Apache umbrella to better serve all users and not only those
>> using Scala. Avoiding any artificial split of the Spark community.
>> Reasons for having such a channel can be found in the linked thread.
>>
>> ps.
>> Please CC me, since I'm not on the list.
>>
>> Best,
>> -jhz
>>
>> (Resent, something apparently ate my first e-mail.)
>>
>> [0] - https://groups.google.com/forum/#!topic/scala-debate/OVGnIU2SNmc
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Using Spark when data definitions are unknowable at compile time

2016-04-28 Thread Dean Wampler
I would start with using DataFrames and the Row
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row>
API, because you can fetch fields by index. Presumably, you'll parse the
incoming data and determine what fields have what types, etc. Or, will
someone specify the schema dynamically some how?

Either way, once you know the types and indices of the fields you need for
a given query, you can fetch them using the Row methods.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 28, 2016 at 11:34 AM, _na <nikhila.alb...@seeq.com> wrote:

> We are looking to incorporate Spark into a timeseries data investigation
> application, but we are having a hard time transforming our workflow into
> the required transformations-on-data model. The crux of the problem is that
> we don’t know a priori which data will be required for our transformations.
>
> For example, a common request might be `average($series2.within($ranges))`,
> where in order to fetch the right sections of data from $series2, $ranges
> will need to be computed first and then used to define data boundaries.
>
> Is there a way to get around the need to define data first in Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Spark-when-data-definitions-are-unknowable-at-compile-time-tp17371.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-06 Thread Dean Wampler
A few other reasons to drop 2.10 support sooner rather than later.

   - We at Lightbend are evaluating some fundamental changes to the REPL to
   make it work better for large heaps, especially for Spark. There are other
   recent and planned enhancements. This work will be benefit notebook users,
   too. However, we won't back port these improvements to 2.10.
   - Scala 2.12 is coming out midyear. It will require Java 8, which means
   it will produce dramatically smaller code (by exploiting lambdas instead of
   custom class generation for functions) and it will offer some performance
   improvements. Hopefully Spark will will support it as an optional Scala
   version relatively quickly after availability, which means it would be nice
   to avoid supporting 3 versions of Scala.

Using Scala 2.10 at this point is like using Java 1.6, seriously out of
date. If you're using libraries that still require 2.10, are you sure that
library is being properly maintained? Or is it a legacy dependency that
should be eliminated before it becomes a liability? Even if you can't
upgrade Scala versions in the next few months, you can certainly continue
using Spark 1.X until you're ready to upgrade.

So, I recommend that Spark 2.0 drop Scala 2.10 support from the beginning.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Lightbend <http://lightbend.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Apr 5, 2016 at 8:54 PM, Kostas Sakellis <kos...@cloudera.com> wrote:

> From both this and the JDK thread, I've noticed (including myself) that
> people have different notions of compatibility guarantees between major and
> minor versions.
> A simple question I have is: What compatibility can we break between minor
> vs. major releases?
>
> It might be worth getting on the same page wrt compatibility guarantees.
>
> Just a thought,
> Kostas
>
> On Tue, Apr 5, 2016 at 4:39 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is
>> deprecation warnings in our builds that we can't fix without introducing a
>> wrapper/ scala version specific code. This isn't a big deal, and if we drop
>> 2.10 in the 3-6 month time frame talked about we can cleanup those warnings
>> once we get there.
>>
>> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors <
>> raymond.honderd...@sizmek.com> wrote:
>>
>>> What about a seperate branch for scala 2.10?
>>>
>>>
>>>
>>> Sent from my Samsung Galaxy smartphone.
>>>
>>>
>>>  Original message 
>>> From: Koert Kuipers <ko...@tresata.com>
>>> Date: 4/2/2016 02:10 (GMT+02:00)
>>> To: Michael Armbrust <mich...@databricks.com>
>>> Cc: Matei Zaharia <matei.zaha...@gmail.com>, Mark Hamstra <
>>> m...@clearstorydata.com>, Cody Koeninger <c...@koeninger.org>, Sean
>>> Owen <so...@cloudera.com>, dev@spark.apache.org
>>> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x
>>> lifecycle
>>>
>>> as long as we don't lock ourselves into supporting scala 2.10 for the
>>> entire spark 2 lifespan it sounds reasonable to me
>>>
>>> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> +1 to Matei's reasoning.
>>>>
>>>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia <matei.zaha...@gmail.com
>>>> > wrote:
>>>>
>>>>> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
>>>>> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
>>>>> the default version we built with in 1.x. We want to make the transition
>>>>> from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
>>>>> downloads be for Scala 2.11, so people will more easily move, but we
>>>>> shouldn't create obstacles that lead to fragmenting the community and
>>>>> slowing down Spark 2.0's adoption. I've seen companies that stayed on an
>>>>> old Scala version for multiple years because switching it, or mixing
>>>>> versions, would affect the company's entire codebase.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Mar 30, 2016, at 12:08 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>
>>>>> oh wow, had no idea it got ripped out
>>>>>
>>>>> On Wed, Mar 30, 2016 at 11:50 AM, Mark H

Re: Akka with Spark

2015-12-27 Thread Dean Wampler
As Reynold said, you can still use Akka with Spark, but now it's more like
using any third-party library that isn't already a Spark dependency (at
least once the current Akka dependency is fully removed).

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, Dec 27, 2015 at 4:06 AM, Disha Shrivastava <dishu@gmail.com>
wrote:

> Hi All,
>
> I need an Akka like framework to implement model parallelism in neural
> networks, an architecture similar to that given in the link
> http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html.
> I need to divide a big neural network ( which can't fit into the memory of
> one machine) layer by layer and do message passing across actors which are
> distributed across different worker machines. I found Akka to be most
> suitable for the job.
>
> Please suggest if it can be done by any other suitable frameworks.
>
> Regards,
> Disha
>
> On Sun, Dec 27, 2015 at 1:04 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> We are just removing Spark's dependency on Akka. It has nothing to do
>> with whether user applications can use Akka or not. As a matter of fact, by
>> removing the Akka dependency from Spark, it becomes easier for user
>> applications to use Akka, because there is no more dependency conflict.
>>
>> For more information, see
>> https://issues.apache.org/jira/browse/SPARK-5293
>>
>> On Sat, Dec 26, 2015 at 9:31 PM, Soumya Simanta <soumya.sima...@gmail.com
>> > wrote:
>>
>>>
>>>
>>> Any rationale for removing Akka from Spark ? Also, what is the
>>> replacement ?
>>>
>>> Thanks
>>>
>>> On Dec 27, 2015, at 8:31 AM, Dean Wampler <deanwamp...@gmail.com> wrote:
>>>
>>> Note that Akka is being removed from Spark. Even if it weren't, I would
>>> consider keeping Akka processes separate from Spark processes, so you can
>>> monitor, debug, and scale them independently. So consider streaming data
>>> from Akka to Spark Streaming or go the other way, from Spark to Akka
>>> Streams.
>>>
>>> dean
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Sat, Dec 26, 2015 at 12:54 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Do you mind sharing your use case ?
>>>>
>>>> It may be possible to use a different approach than Akka.
>>>>
>>>> Cheers
>>>>
>>>> On Sat, Dec 26, 2015 at 10:08 AM, Disha Shrivastava <
>>>> dishu@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I wanted to know how to use Akka framework with Spark starting from
>>>>> basics. I saw online that Spark uses Akka framework but I am not really
>>>>> sure if I can define Actors and use it in Spark.
>>>>>
>>>>> Also, how to integrate Akka with Spark as in how will I know how many
>>>>> Akka actors are running on each of my worker machines? Can I control that?
>>>>>
>>>>> Please help. The only useful resource which I could find online was
>>>>> Akka with Spark Streaming which was also not very clear.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Disha
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Akka with Spark

2015-12-26 Thread Dean Wampler
Note that Akka is being removed from Spark. Even if it weren't, I would
consider keeping Akka processes separate from Spark processes, so you can
monitor, debug, and scale them independently. So consider streaming data
from Akka to Spark Streaming or go the other way, from Spark to Akka
Streams.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sat, Dec 26, 2015 at 12:54 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Do you mind sharing your use case ?
>
> It may be possible to use a different approach than Akka.
>
> Cheers
>
> On Sat, Dec 26, 2015 at 10:08 AM, Disha Shrivastava <dishu@gmail.com>
> wrote:
>
>> Hi,
>>
>> I wanted to know how to use Akka framework with Spark starting from
>> basics. I saw online that Spark uses Akka framework but I am not really
>> sure if I can define Actors and use it in Spark.
>>
>> Also, how to integrate Akka with Spark as in how will I know how many
>> Akka actors are running on each of my worker machines? Can I control that?
>>
>> Please help. The only useful resource which I could find online was Akka
>> with Spark Streaming which was also not very clear.
>>
>> Thanks,
>>
>> Disha
>>
>
>


Re: [ANNOUNCE] Spark 1.6.0 Release Preview

2015-11-23 Thread Dean Wampler
utor.runWorker(ThreadPoolExecutor.java:1142)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:744)
Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
  at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:190)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  ... 15 more
15/11/23 13:04:56 WARN NettyRpcEndpointRef: Ignore message
Success(HeartbeatResponse(false))

[Stage 1:=> (2204 + 6) /
10]
[Stage 1:=> (2858 + 4) /
10]
[Stage 1:=> (3616 + 5) /
10]
... elided ...
[Stage 1:=>(98393 + 4) /
10]
[Stage 1:=>(99347 + 4) /
10]
[Stage 1:=====>(99734 + 4) /
10]

res1: Long = 100

scala>


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, Nov 22, 2015 at 4:21 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> In order to facilitate community testing of Spark 1.6.0, I'm excited to
> announce the availability of an early preview of the release. This is not a
> release candidate, so there is no voting involved. However, it'd be awesome
> if community members can start testing with this preview package and report
> any problems they encounter.
>
> This preview package contains all the commits to branch-1.6
> <https://github.com/apache/spark/tree/branch-1.6> till commit
> 308381420f51b6da1007ea09a02d740613a226e0
> <https://github.com/apache/spark/tree/v1.6.0-preview2>.
>
> The staging maven repository for this preview build can be found here:
> https://repository.apache.org/content/repositories/orgapachespark-1162
>
> Binaries for this preview build can be found here:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-bin/
>
> A build of the docs can also be found here:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-docs/
>
> The full change log for this release can be found on JIRA
> <https://issues.apache.org/jira/browse/SPARK-11908?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.6.0>
> .
>
> *== How can you help? ==*
>
> If you are a Spark user, you can help us test this release by taking a
> Spark workload and running on this preview release, then reporting any
> regressions.
>
> *== Major Features ==*
>
> When testing, we'd appreciate it if users could focus on areas that have
> changed in this release.  Some notable new features include:
>
> SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> *Parquet
> Performance* - Improve Parquet scan performance when using flat schemas.
> SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810> *Session *
> *Management* - Multiple users of the thrift (JDBC/ODBC) server now have
> isolated sessions including their own default database (i.e USE mydb)
> even on shared clusters.
> SPARK-  <https://issues.apache.org/jira/browse/SPARK-> *Dataset
> API* - A new, experimental type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten)
> SPARK-1 <https://issues.apache.org/jira/browse/SPARK-1> *Unified
> Memory Management* - Shared memory for execution and caching instead of
> exclusive division of the regions.
> SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> *Datasource
> API Avoid Double Filter* - When implementing a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> *New
> improved state management* - trackStateByKey - a DStream transformation
> for stateful stream processing, supersedes updateStateByKey in
> functionality and performance.
>
> Happy testing!
>
> Michael
>
>


Re: Removing the Mesos fine-grained mode

2015-11-19 Thread Dean Wampler
Sounds like the right move. Simplifies things in important ways.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Nov 19, 2015 at 5:42 AM, Iulian Dragoș <iulian.dra...@typesafe.com>
wrote:

> Hi all,
>
> Mesos is the only cluster manager that has a fine-grained mode, but it's
> more often than not problematic, and it's a maintenance burden. I'd like to
> suggest removing it in the 2.0 release.
>
> A few reasons:
>
> - code/maintenance complexity. The two modes duplicate a lot of
> functionality (and sometimes code) that leads to subtle differences or
> bugs. See SPARK-10444 <https://issues.apache.org/jira/browse/SPARK-10444> and
> also this thread
> <https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3ccalxmp-a+aygnwsiytm8ff20-mgwhykbhct94a2hwzth1jwh...@mail.gmail.com%3E>
>  and MESOS-3202 <https://issues.apache.org/jira/browse/MESOS-3202>
> - it's not widely used (Reynold's previous thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html>
> got very few responses from people relying on it)
> - similar functionality can be achieved with dynamic allocation +
> coarse-grained mode
>
> I suggest that Spark 1.6 already issues a warning if it detects
> fine-grained use, with removal in the 2.0 release.
>
> Thoughts?
>
> iulian
>
>


Re: [discuss] ending support for Java 6?

2015-05-01 Thread Dean Wampler
FWIW, another reason to start planning for deprecation of Java 7, too, is
that Scala 2.12 will require Java 8. Scala 2.12 will be released early next
year.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 30, 2015 at 3:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 +1 on ending support for Java 6.

 BTW from https://www.java.com/en/download/faq/java_7.xml :
 After April 2015, Oracle will no longer post updates of Java SE 7 to its
 public download sites.

 On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
 wrote:

  I'm in favor of ending support for Java 6. We should also articulate a
  policy on how long we want to support current and future versions of Java
  after Oracle declares them EOL (Java 7 will be in that bucket in a matter
  of days).
 
  Punya
  On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote:
 
   something to keep in mind:  we can easily support java 6 for the build
   environment, particularly if there's a definite EOL.
  
   i'd like to fix our java versioning 'problem', and this could be a big
   instigator...  right now we're hackily setting java_home in test
  invocation
   on jenkins, which really isn't the best.  if i decide, within jenkins,
 to
   reconfigure every build to 'do the right thing' WRT java version, then
 i
   will clean up the old mess and pay down on some technical debt.
  
   or i can just install java 6 and we use that as JAVA_HOME on a
   build-by-build basis.
  
   this will be a few days of prep and another morning-long downtime if i
 do
   the right thing (within jenkins), and only a couple of hours the hacky
  way
   (system level).
  
   either way, we can test on java 6.  :)
  
   On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
nicholas started it! :)
   
for java 6 i would have said the same thing about 1 year ago: it is
   foolish
to drop it. but i think the time is right about now.
about half our clients are on java 7 and the other half have active
  plans
to migrate to it within 6 months.
   
On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com
   wrote:
   
 Guys thanks for chiming in, but please focus on Java here. Python
 is
  an
 entirely separate issue.


 On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
 
wrote:

 i am not sure eol means much if it is still actively used. we
 have a
   lot
 of clients with centos 5 (for which we still support python 2.4 in
   some
 form or another, fun!). most of them are on centos 6, which means
   python
 2.6. by cutting out python 2.6 you would cut out the majority of
 the
actual
 clusters i am aware of. unless you intention is to truly make
   something
 academic i dont think that is wise.

 On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 (On that note, I think Python 2.6 should be next on the chopping
   block
 sometime later this year, but that’s for another thread.)

 (To continue the parenthetical, Python 2.6 was in fact EOL-ed in
October
 of
 2013. https://www.python.org/download/releases/2.6.9/)
 ​

 On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
 nicholas.cham...@gmail.com
 wrote:

  I understand the concern about cutting out users who still use
  Java
6,
 and
  I don't have numbers about how many people are still using Java
  6.
 
  But I want to say at a high level that I support deprecating
  older
  versions of stuff to reduce our maintenance burden and let us
 use
more
  modern patterns in our code.
 
  Maintenance always costs way more than initial development over
  the
  lifetime of a project, and for that reason anti-support is
 just
   as
  important as support.
 
  (On that note, I think Python 2.6 should be next on the
 chopping
block
  sometime later this year, but that's for another thread.)
 
  Nick
 
 
  On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin 
 r...@databricks.com
  
 wrote:
 
  This has been discussed a few times in the past, but now
 Oracle
   has
 ended
  support for Java 6 for over a year, I wonder if we should just
   drop
 Java 6
  support.
 
  There is one outstanding issue Tom has brought to my
 attention:
 PySpark on
  YARN doesn't work well with Java 7/8, but we have an
 outstanding
pull
  request to fix that.
 
  https://issues.apache.org/jira/browse/SPARK-6869
  https://issues.apache.org/jira/browse/SPARK-1920
 
 




   
  
 



Re: Need advice for Spark newbie

2015-02-26 Thread Dean Wampler
Historically, many orgs. have replaced data warehouses with Hadoop clusters
and used Hive along with Impala (on Cloudera deployments) or Drill (on MapR
deployments) for SQL. Hive is older and slower, while Impala and Drill are
newer and faster, but you typically need both for their complementary
features, at least today.

Spark and Spark SQL are not yet complete replacements for them, but they'll
get there over time. The good news is, you can mix and match these tools,
as appropriate, because they can all work with the same datasets.

The challenge is all the tribal knowledge required to setup and manage
Hadoop clusters, to properly organize your data for best performance for
your needs, to use all these tools effectively, along with additional
Hadoop ETL tools, etc. Fortunately, tools like Tableau are already
integrated here.

However, none of this will be as polished and integrated as what you're
used to. You're trading that polish for greater scalability and flexibility.

HTH.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone vikramk...@gmail.com wrote:

 Hi,
 I'm a newbie when it comes to Spark and Hadoop eco system in general. Our
 team has been predominantly a Microsoft shop that uses MS stack for most of
 their BI needs. So we are talking SQL server  for storing relational data
 and SQL Server Analysis services for building MOLAP cubes for sub-second
 query analysis.
 Lately, we have been hitting degradation in our cube query response times
 as our data sizes grew considerably the past year. We are talking fact
 tables which are in 1o-100 billions of rows range and a few dimensions in
 the 10-100's of millions of rows. We tried vertically scaling up our SSAS
 server but queries are still taking few minutes. In light of this, I was
 entrusted with task of figuring out an open source solution that would
 scale to our current and future needs for data analysis.
 I looked at a bunch of open source tools like Apache Drill, Druid, AtScale,
 Spark, Storm, Kylin etc and settled on exploring Spark as the first step
 given it's recent rise in popularity and growing eco-system around it.
 Since we are also interested in doing deep data analysis like machine
 learning and graph algorithms on top our data, spark seems to be a good
 solution.
 I would like to build out a POC for our MOLAP cubes using spark with
 HDFS/Hive as the datasource and see how it scales for our queries/measures
 in real time with real data.
 Roughly, these are the requirements for our team
 1. Should be able to create facts, dimensions and measures from our data
 sets in an easier way.
 2. Cubes should be query able from Excel and Tableau.
 3. Easily scale out by adding new nodes when data grows
 4. Very less maintenance and highly stable for production level workloads
 5. Sub second query latencies for COUNT DISTINCT measures (since majority
 of our expensive measures are of this type) . Are ok with Approx Distinct
 counts for better perf.

 So given these requirements, is Spark the right solution to replace our
 on-premise MOLAP cubes?
 Are there any tutorials or documentation on how to build cubes using Spark?
 Is that even possible? or even necessary? As long as our users can
 pivot/slice  dice the measures quickly from client tools by dragging
 dropping dimensions into rows/columns w/o the need to join to fact table,
 we are ok with however the data is laid out. Doesn't have to be a cube. It
 can be a flat file in hdfs for all we care. I would love to chat with some
 one who has successfully done this kind of migration from OLAP cubes to
 Spark in their team or company .

 This is it for now. Looking forward to a great discussion.

 P.S. We have decided on using Azure HDInsight as our managed hadoop system
 in the cloud.



Re: Need advice for Spark newbie

2015-02-26 Thread Dean Wampler
There's no support for star or snowflake models, per se. What you get with
Hadoop is access to all your data and the processing power to build the ad
hoc queries you want, when you need them, rather than having to figure out
a schema/model in advance.

I recommend that you also ask your questions on one of the Hadoop or Hive
user mailing lists, where you'll find people who have moved data warehouses
to Hadoop. Then you can use Spark for some of the tasks you'll do. This
dev (developer) mailing list isn't really the place to discuss this
anyway. (The user list would be slightly better.)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Feb 26, 2015 at 3:23 PM, Vikram Kone vikramk...@gmail.com wrote:

 Dean
 Thanks for the info. Are you saying that we can create star/snowflake data
 models using spark so they can be queried from tableau ?


 On Thursday, February 26, 2015, Dean Wampler deanwamp...@gmail.com
 wrote:

 Historically, many orgs. have replaced data warehouses with Hadoop
 clusters and used Hive along with Impala (on Cloudera deployments) or Drill
 (on MapR deployments) for SQL. Hive is older and slower, while Impala and
 Drill are newer and faster, but you typically need both for their
 complementary features, at least today.

 Spark and Spark SQL are not yet complete replacements for them, but
 they'll get there over time. The good news is, you can mix and match these
 tools, as appropriate, because they can all work with the same datasets.

 The challenge is all the tribal knowledge required to setup and manage
 Hadoop clusters, to properly organize your data for best performance for
 your needs, to use all these tools effectively, along with additional
 Hadoop ETL tools, etc. Fortunately, tools like Tableau are already
 integrated here.

 However, none of this will be as polished and integrated as what you're
 used to. You're trading that polish for greater scalability and flexibility.

 HTH.


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone vikramk...@gmail.com
 wrote:

 Hi,
 I'm a newbie when it comes to Spark and Hadoop eco system in general. Our
 team has been predominantly a Microsoft shop that uses MS stack for most
 of
 their BI needs. So we are talking SQL server  for storing relational data
 and SQL Server Analysis services for building MOLAP cubes for sub-second
 query analysis.
 Lately, we have been hitting degradation in our cube query response times
 as our data sizes grew considerably the past year. We are talking fact
 tables which are in 1o-100 billions of rows range and a few dimensions in
 the 10-100's of millions of rows. We tried vertically scaling up our SSAS
 server but queries are still taking few minutes. In light of this, I was
 entrusted with task of figuring out an open source solution that would
 scale to our current and future needs for data analysis.
 I looked at a bunch of open source tools like Apache Drill, Druid,
 AtScale,
 Spark, Storm, Kylin etc and settled on exploring Spark as the first step
 given it's recent rise in popularity and growing eco-system around it.
 Since we are also interested in doing deep data analysis like machine
 learning and graph algorithms on top our data, spark seems to be a good
 solution.
 I would like to build out a POC for our MOLAP cubes using spark with
 HDFS/Hive as the datasource and see how it scales for our
 queries/measures
 in real time with real data.
 Roughly, these are the requirements for our team
 1. Should be able to create facts, dimensions and measures from our data
 sets in an easier way.
 2. Cubes should be query able from Excel and Tableau.
 3. Easily scale out by adding new nodes when data grows
 4. Very less maintenance and highly stable for production level workloads
 5. Sub second query latencies for COUNT DISTINCT measures (since majority
 of our expensive measures are of this type) . Are ok with Approx Distinct
 counts for better perf.

 So given these requirements, is Spark the right solution to replace our
 on-premise MOLAP cubes?
 Are there any tutorials or documentation on how to build cubes using
 Spark?
 Is that even possible? or even necessary? As long as our users can
 pivot/slice  dice the measures quickly from client tools by dragging
 dropping dimensions into rows/columns w/o the need to join to fact table,
 we are ok with however the data is laid out. Doesn't have to be a cube.
 It
 can be a flat file in hdfs for all we care. I would love to chat with
 some
 one who has successfully done this kind of migration from OLAP cubes to
 Spark in their team or company .

 This is it for now

Re: best IDE for scala + spark development?

2014-10-27 Thread Dean Wampler
For what it's worth, I use Sublime Text + the SBT console for everything. I
can live without the extra IDE features.

However, if you like an IDE, the Eclipse Scala IDE 4.0 RC1 is a big
improvement over previous releases. For one thing, it can now supports
projects using different versions of Scala, which is convenient for Spark's
current 2.10.4 support and emerging 2.11 support.

http://scala-ide.org/download/milestone.html

Dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sun, Oct 26, 2014 at 5:06 PM, Duy Huynh duy.huynh@gmail.com wrote:

 i like intellij and eclipse too, but some that they are too heavy.  i would
 love to use vim.  are there are good scala plugins for vim?  (i.e code
 completion, scala doc, etc)

 On Sun, Oct 26, 2014 at 12:32 PM, Jay Vyas jayunit100.apa...@gmail.com
 wrote:

  I tried the scala eclipse ide but in scala 2.10 I ran into some weird
  issues
 
 http://stackoverflow.com/questions/24253084/scalaide-and-cryptic-classnotfound-errors
  ... So I switched to IntelliJ and was much more satisfied...
 
  I've written a post on how I use fedora,sbt, and intellij for spark apps.
 
 
 http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html?m=1
 
  The IntelliJ sbt plugin is imo less buggy then the eclipse scalaIDE
  stuff.  For example, I found I had to set some special preferences
 
  Finally... given sbts automated recompile option, if you just use tmux,
  and vim nerdtree, with sbt , you could come pretty close to something
 like
  an IDE without all the drama ..
 
   On Oct 26, 2014, at 11:07 AM, ll duy.huynh@gmail.com wrote:
  
   i'm new to both scala and spark.  what IDE / dev environment do you
 find
  most
   productive for writing code in scala with spark?  is it just vim + sbt?
  or
   does a full IDE like intellij works out better?  thanks!
  
  
  
   --
   View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html
   Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 



Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Dean Wampler
Try this:

#!/bin/bash
for x in $@; do
  echo arg: $x
done
ARGS_COPY=($@) # Make ARGS_COPY an array with the array elements in $@

for x in ${ARGS_COPY[@]}; do# preserve array arguments.
  echo arg_copy: $x
done



On Wed, Apr 30, 2014 at 3:51 PM, Patrick Wendell pwend...@gmail.com wrote:

 So I reproduced the problem here:

 == test.sh ==
 #!/bin/bash
 for x in $@; do
   echo arg: $x
 done
 ARGS_COPY=$@
 for x in $ARGS_COPY; do
   echo arg_copy: $x
 done
 ==

 ./test.sh a b c d e f
 arg: a
 arg: b
 arg: c d e
 arg: f
 arg_copy: a b c d e f

 I'll dig around a bit more and see if we can fix it. Pretty sure we
 aren't passing these argument arrays around correctly in bash.

 On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Yeah I think the problem is that the spark-submit script doesn't pass
  the argument array to spark-class in the right way, so any quoted
  strings get flattened.
 
  I think we'll need to figure out how to do this correctly in the bash
  script so that quoted strings get passed in the right way.
 
  I tried a few different approaches but finally ended up giving up; my
  bash-fu is apparently not strong enough. If you can make it work
  great, but I have -J working locally in case you give up like me.
  :-)
 
  --
  Marcelo




-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Spark 1.0.0 rc3

2014-04-29 Thread Dean Wampler
Thanks. I'm fine with the logic change, although I was a bit surprised to
see Hadoop used for file I/O.

Anyway, the jira issue and pull request discussions mention a flag to
enable overwrites. That would be very convenient for a tutorial I'm
writing, although I wouldn't recommend it for normal use, of course.
However, I can't figure out if this actually exists. I found the
spark.files.overwrite property, but that doesn't apply.  Does this override
flag, method call, or method argument actually exist?

Thanks,
Dean


On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi Dean,

 We always used the Hadoop libraries here to read and write local
 files. In Spark 1.0 we started enforcing the rule that you can't
 over-write an existing directory because it can cause
 confusing/undefined behavior if multiple jobs output to the directory
 (they partially clobber each other's output).

 https://issues.apache.org/jira/browse/SPARK-1100
 https://github.com/apache/spark/pull/11

 In the JIRA I actually proposed slightly deviating from Hadoop
 semantics and allowing the directory to exist if it is empty, but I
 think in the end we decided to just go with the exact same semantics
 as Hadoop (i.e. empty directories are a problem).

 - Patrick

 On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com
 wrote:
  I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
 using
  HDFS classes for file I/O, while the same script compiled and running
 with
  0.9.1 uses only the local-mode File IO.
 
  The script is a variation of the Word Count script. Here are the guts:
 
  object WordCount2 {
def main(args: Array[String]) = {
 
  val sc = new SparkContext(local, Word Count (2))
 
  val input = sc.textFile(.../some/local/file).map(line =
  line.toLowerCase)
  input.cache
 
  val wc2 = input
.flatMap(line = line.split(\W+))
.map(word = (word, 1))
.reduceByKey((count1, count2) = count1 + count2)
 
  wc2.saveAsTextFile(output/some/directory)
 
  sc.stop()
 
  It works fine compiled and executed with 0.9.1. If I recompile and run
 with
  1.0.0-RC1, where the same output directory still exists, I get this
  familiar Hadoop-ish exception:
 
  [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
  Output directory
 
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
  already exists
  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
 
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
  already exists
   at
 
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
  at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
   at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
  at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
  at spark.activator.WordCount2$.main(WordCount2.scala:42)
   at spark.activator.WordCount2.main(WordCount2.scala)
  ...
 
  Thoughts?
 
 
  On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hey All,
 
  This is not an official vote, but I wanted to cut an RC so that people
 can
  test against the Maven artifacts, test building with their
 configuration,
  etc. We are still chasing down a few issues and updating docs, etc.
 
  If you have issues or bug reports for this release, please send an
 e-mail
  to the Spark dev list and/or file a JIRA.
 
  Commit: d636772 (v1.0.0-rc3)
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
 
  Binaries:
  http://people.apache.org/~pwendell/spark-1.0.0-rc3/
 
  Docs:
  http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
 
  Repository:
  https://repository.apache.org/content/repositories/orgapachespark-1012/
 
  == API Changes ==
  If you want to test building against Spark there are some minor API
  changes. We'll get these written up for the final release but I'm
 noting a
  few here (not comprehensive):
 
  changes to ML vector specification:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 
  Streaming classes have been renamed:
  NetworkReceiver - Receiver
 
 
 
 
  --
  Dean Wampler, Ph.D.
  Typesafe
  @deanwampler
  http://typesafe.com
  http://polyglotprogramming.com




-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler