Re: Welcome Jose Torres as a Spark committer
Congrats, Jose! *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* On Tue, Jan 29, 2019 at 12:52 PM Burak Yavuz wrote: > Congrats Jose! > > On Tue, Jan 29, 2019 at 10:50 AM Xiao Li wrote: > >> Congratulations! >> >> Xiao >> >> Shixiong Zhu 于2019年1月29日周二 上午10:48写道: >> >>> Hi all, >>> >>> The Apache Spark PMC recently added Jose Torres as a committer on the >>> project. Jose has been a major contributor to Structured Streaming. Please >>> join me in welcoming him! >>> >>> Best Regards, >>> >>> Shixiong Zhu >>> >>>
Re: Make Scala 2.12 as default Scala version in Spark 3.0
I spoke with the Scala team at Lightbend. They plan to do a 2.13-RC1 release in January and GA a few months later. Of course, nothing is ever certain. What's the thinking for the Spark 3.0 timeline? If it's likely to be late Q1 or in Q2, then it might make sense to add Scala 2.13 as an alternative Scala version. dean *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, and other content from O'Reilly @deanwampler <http://twitter.com/deanwampler> https://www.linkedin.com/in/deanwampler/ http://polyglotprogramming.com https://github.com/deanwampler https://www.flickr.com/photos/deanwampler/ On Tue, Nov 6, 2018 at 7:48 PM Sean Owen wrote: > That's possible here, sure. The issue is: would you exclude Scala 2.13 > support in 3.0 for this, if it were otherwise ready to go? > I think it's not a hard rule that something has to be deprecated > previously to be removed in a major release. The notice is helpful, > sure, but there are lots of ways to provide that notice to end users. > Lots of things are breaking changes in a major release. Or: deprecate > in Spark 2.4.1, if desired? > > On Tue, Nov 6, 2018 at 7:36 PM Wenchen Fan wrote: > > > > We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in > Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark > 3.x? > > > > On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin wrote: > >> > >> Have we deprecated Scala 2.11 already in an existing release? > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Scala 2.12 support
Do the tests expect a particular console output order? That would annoy them. ;) You could sort the expected and output lines, then diff... *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, and other content from O'Reilly @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com https://github.com/deanwampler On Thu, Jun 7, 2018 at 5:09 PM, Holden Karau wrote: > If the difference is the order of the welcome message I think that should > be fine. > > On Thu, Jun 7, 2018, 4:43 PM Dean Wampler wrote: > >> I'll point the Scala team to this issue, but it's unlikely to get fixed >> any time soon. >> >> dean >> >> >> *Dean Wampler, Ph.D.* >> >> *VP, Fast Data Engineering at Lightbend* >> Author: Programming Scala, 2nd Edition >> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data >> Architectures for Streaming Applications >> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, >> and other content from O'Reilly >> @deanwampler <http://twitter.com/deanwampler> >> http://polyglotprogramming.com >> https://github.com/deanwampler >> >> On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai wrote: >> >>> Thanks Felix for bringing this up. >>> >>> Currently, in Scala 2.11.8, we initialize the Spark by overriding >>> loadFIles() before REPL sees any file since there is no good hook in Scala >>> to load our initialization code. >>> >>> In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles() >>> method was removed. >>> >>> Alternatively, one way we can do in the newer version of Scala is by >>> overriding initializeSynchronous() suggested by Som Snytt; I have a working >>> PR with this approach, >>> https://github.com/apache/spark/pull/21495 , and this approach should >>> work for older version of Scala too. >>> >>> However, in the newer version of Scala, the first thing that the REPL >>> calls is printWelcome, so in the newer version of Scala, welcome message >>> will be shown and then the URL of the SparkUI in this approach. This will >>> cause UI inconsistencies between different versions of Scala. >>> >>> We can also initialize the Spark in the printWelcome which I feel more >>> hacky. It will only work for newer version of Scala since in order version >>> of Scala, printWelcome is called in the end of the initialization process. >>> If we decide to go this route, basically users can not use Scala older than >>> 2.11.9. >>> >>> I think this is also a blocker for us to move to newer version of Scala >>> 2.12.x since the newer version of Scala 2.12.x has the same issue. >>> >>> In my opinion, Scala should fix the root cause and provide a stable hook >>> for 3rd party developers to initialize their custom code. >>> >>> DB Tsai | Siri Open Source Technologies [not a contribution] | >>> Apple, Inc >>> >>> > On Jun 7, 2018, at 6:43 AM, Felix Cheung >>> wrote: >>> > >>> > +1 >>> > >>> > Spoke to Dean as well and mentioned the problem with 2.11.12 >>> https://github.com/scala/bug/issues/10913 >>> > >>> > _ >>> > From: Sean Owen >>> > Sent: Wednesday, June 6, 2018 12:23 PM >>> > Subject: Re: Scala 2.12 support >>> > To: Holden Karau >>> > Cc: Dean Wampler , Reynold Xin < >>> r...@databricks.com>, dev >>> > >>> > >>> > If it means no change to 2.11 support, seems OK to me for Spark 2.4.0. >>> The 2.12 support is separate and has never been mutually compatible with >>> 2.11 builds anyway. (I also hope, suspect that the changes are minimal; >>> tests are already almost entirely passing with no change to the closure >>> cleaner when built for 2.12) >>> > >>> > On Wed, Jun 6, 2018 at 1:33 PM Holden Karau >>> wrote: >>> > Just chatted with Dean @ the summit and it sounds like from Adriaan >>> there is a fix in 2.13 for the API change issue that could be back ported >>> to 2.12 so how about we try and get this ball rolling? >>> > >>> > It sounds like it would also need a closure cleaner change, which >>> could be backwards compatible but since it’s such a core component and we >>> might want to be cautious with it, we could when building for 2.11 use the >>> old cleaner code and for 2.12 use the new code so we don’t break anyone. >>> > >>> > How do folks feel about this? >>> > >>> > >>> > >>> >>> >>
Re: Scala 2.12 support
I'll point the Scala team to this issue, but it's unlikely to get fixed any time soon. dean *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, and other content from O'Reilly @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com https://github.com/deanwampler On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai wrote: > Thanks Felix for bringing this up. > > Currently, in Scala 2.11.8, we initialize the Spark by overriding > loadFIles() before REPL sees any file since there is no good hook in Scala > to load our initialization code. > > In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles() method > was removed. > > Alternatively, one way we can do in the newer version of Scala is by > overriding initializeSynchronous() suggested by Som Snytt; I have a working > PR with this approach, > https://github.com/apache/spark/pull/21495 , and this approach should > work for older version of Scala too. > > However, in the newer version of Scala, the first thing that the REPL > calls is printWelcome, so in the newer version of Scala, welcome message > will be shown and then the URL of the SparkUI in this approach. This will > cause UI inconsistencies between different versions of Scala. > > We can also initialize the Spark in the printWelcome which I feel more > hacky. It will only work for newer version of Scala since in order version > of Scala, printWelcome is called in the end of the initialization process. > If we decide to go this route, basically users can not use Scala older than > 2.11.9. > > I think this is also a blocker for us to move to newer version of Scala > 2.12.x since the newer version of Scala 2.12.x has the same issue. > > In my opinion, Scala should fix the root cause and provide a stable hook > for 3rd party developers to initialize their custom code. > > DB Tsai | Siri Open Source Technologies [not a contribution] | > Apple, Inc > > > On Jun 7, 2018, at 6:43 AM, Felix Cheung > wrote: > > > > +1 > > > > Spoke to Dean as well and mentioned the problem with 2.11.12 > https://github.com/scala/bug/issues/10913 > > > > _ > > From: Sean Owen > > Sent: Wednesday, June 6, 2018 12:23 PM > > Subject: Re: Scala 2.12 support > > To: Holden Karau > > Cc: Dean Wampler , Reynold Xin < > r...@databricks.com>, dev > > > > > > If it means no change to 2.11 support, seems OK to me for Spark 2.4.0. > The 2.12 support is separate and has never been mutually compatible with > 2.11 builds anyway. (I also hope, suspect that the changes are minimal; > tests are already almost entirely passing with no change to the closure > cleaner when built for 2.12) > > > > On Wed, Jun 6, 2018 at 1:33 PM Holden Karau > wrote: > > Just chatted with Dean @ the summit and it sounds like from Adriaan > there is a fix in 2.13 for the API change issue that could be back ported > to 2.12 so how about we try and get this ball rolling? > > > > It sounds like it would also need a closure cleaner change, which could > be backwards compatible but since it’s such a core component and we might > want to be cautious with it, we could when building for 2.11 use the old > cleaner code and for 2.12 use the new code so we don’t break anyone. > > > > How do folks feel about this? > > > > > > > >
Re: Scala 2.12 support
Hi, Reynold, Sorry for the delay in replying; I was traveling. The Scala changes would avoid the need to change the API now. Basically, the compiler would be modified to detect the particular case of the two ambiguous, overloaded methods, then pick the best fit in a more "intelligent" way. (They can provide more specific details). This would not address the closure cleaner changes required. However, the Scala team offered to provide suggestions or review changes. dean *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, and other content from O'Reilly @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com https://github.com/deanwampler On Thu, Apr 19, 2018 at 6:46 PM, Reynold Xin <r...@databricks.com> wrote: > Forking the thread to focus on Scala 2.12. > > Dean, > > There are couple different issues with Scala 2.12 (closure cleaner, API > breaking changes). Which one do you think we can address with a Scala > upgrade? (The closure cleaner one I haven't spent a lot of time looking at > it but it might involve more Spark side changes) > > On Thu, Apr 19, 2018 at 3:28 AM, Dean Wampler <deanwamp...@gmail.com> > wrote: > >> I spoke with Martin Odersky and Lightbend's Scala Team about the known >> API issue with method disambiguation. They offered to implement a small >> patch in a new release of Scala 2.12 to handle the issue without requiring >> a Spark API change. They would cut a 2.12.6 release for it. I'm told that >> Scala 2.13 should already handle the issue without modification (it's not >> yet released, to be clear). They can also offer feedback on updating the >> closure cleaner. >> >> So, this approach would support Scala 2.12 in Spark, but limited to >> 2.12.6+, without the API change requirement, but the closure cleaner would >> still need updating. Hence, it could be done for Spark 2.X. >> >> Let me if you want to pursue this approach. >> >> dean >> >> >> >> >> *Dean Wampler, Ph.D.* >> >> *VP, Fast Data Engineering at Lightbend* >> Author: Programming Scala, 2nd Edition >> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data >> Architectures for Streaming Applications >> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, >> and other content from O'Reilly >> @deanwampler <http://twitter.com/deanwampler> >> http://polyglotprogramming.com >> https://github.com/deanwampler >> >> On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <van...@cloudera.com> >> wrote: >> >>> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com> >>> wrote: >>> > Sorry, but just to be clear here, this is the 2.12 API issue: >>> https://issues.apache.org/jira/browse/SPARK-14643, with more details in >>> this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HK >>> ixuNdxSEvo8nw_tgLgM/edit. >>> > >>> > Basically, if we are allowed to change Spark’s API a little to have >>> only one version of methods that are currently overloaded between Java and >>> Scala, we can get away with a single source three for all Scala versions >>> and Java ABI compatibility against any type of Spark (whether using Scala >>> 2.11 or 2.12). >>> >>> Fair enough. To play devil's advocate, most of those methods seem to >>> be marked "Experimental / Evolving", which could be used as a reason >>> to change them for this purpose in a minor release. >>> >>> Not all of them are, though (e.g. foreach / foreachPartition are not >>> experimental). >>> >>> -- >>> Marcelo >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >
Re: time for Apache Spark 3.0?
I spoke with Martin Odersky and Lightbend's Scala Team about the known API issue with method disambiguation. They offered to implement a small patch in a new release of Scala 2.12 to handle the issue without requiring a Spark API change. They would cut a 2.12.6 release for it. I'm told that Scala 2.13 should already handle the issue without modification (it's not yet released, to be clear). They can also offer feedback on updating the closure cleaner. So, this approach would support Scala 2.12 in Spark, but limited to 2.12.6+, without the API change requirement, but the closure cleaner would still need updating. Hence, it could be done for Spark 2.X. Let me if you want to pursue this approach. dean *Dean Wampler, Ph.D.* *VP, Fast Data Engineering at Lightbend* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, and other content from O'Reilly @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com https://github.com/deanwampler On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > > Sorry, but just to be clear here, this is the 2.12 API issue: > https://issues.apache.org/jira/browse/SPARK-14643, with more details in > this doc: https://docs.google.com/document/d/1P_ > wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit. > > > > Basically, if we are allowed to change Spark’s API a little to have only > one version of methods that are currently overloaded between Java and > Scala, we can get away with a single source three for all Scala versions > and Java ABI compatibility against any type of Spark (whether using Scala > 2.11 or 2.12). > > Fair enough. To play devil's advocate, most of those methods seem to > be marked "Experimental / Evolving", which could be used as a reason > to change them for this purpose in a minor release. > > Not all of them are, though (e.g. foreach / foreachPartition are not > experimental). > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: welcoming Burak and Holden as committers
Congratulations to both of you! dean *Dean Wampler, Ph.D.* Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures for Streaming Applications <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>, Functional Programming for Java Developers <http://shop.oreilly.com/product/0636920021667.do>, and Programming Hive <http://shop.oreilly.com/product/0636920023555.do> (O'Reilly) Lightbend <http://lightbend.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com https://github.com/deanwampler On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li <gatorsm...@gmail.com> wrote: > Congratulations! Burak and Holden! > > 2017-01-24 10:13 GMT-08:00 Reynold Xin <r...@databricks.com>: > >> Hi all, >> >> Burak and Holden have recently been elected as Apache Spark committers. >> >> Burak has been very active in a large number of areas in Spark, including >> linear algebra, stats/maths functions in DataFrames, Python/R APIs for >> DataFrames, dstream, and most recently Structured Streaming. >> >> Holden has been a long time Spark contributor and evangelist. She has >> written a few books on Spark, as well as frequent contributions to the >> Python API to improve its usability and performance. >> >> Please join me in welcoming the two! >> >> >> >
Re: Apache Spark chat channel
Okay, here is a Gitter room for this purpose: https://gitter.im/spark-scala/Lobby If you use the APIs, please join and help those who are learning. I can't answer every question. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Lightbend <http://lightbend.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Oct 6, 2016 at 9:21 AM, Dean Wampler <deanwamp...@gmail.com> wrote: > Since I'm a Scala Spark advocate, I'll try to get a Scala Spark Gitter > channel created, one way or another. > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Lightbend <http://lightbend.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Thu, Oct 6, 2016 at 8:36 AM, Sean Owen <so...@cloudera.com> wrote: > >> Yes this come up once in a while. There's no need or way to stop people >> forming groups to chat, though blessing a new channel as 'official' is >> tough because it means, in theory, everyone has to follow another channel >> to see 100% of the discussion. I think that's why the couple of mailing >> lists, which can be controlled and archived by the ASF, will stay the >> official channels. But, naturally there's no problem with people forming >> unofficial communities. >> >> On Thu, Oct 6, 2016 at 2:33 PM Jan-Hendrik Zab <j...@jhz.name> wrote: >> >>> Hello! >>> >>> There was a request on scala-debate [0] to create a Spark centric chat >>> room under the scala namespace on Gitter with a focus on Scala related >>> questions. >>> >>> This is just a heads up to the Apache Spark "management" to give them a >>> chance to get involved. It might be better to create a dedicated channel >>> under the Apache umbrella to better serve all users and not only those >>> using Scala. Avoiding any artificial split of the Spark community. >>> Reasons for having such a channel can be found in the linked thread. >>> >>> ps. >>> Please CC me, since I'm not on the list. >>> >>> Best, >>> -jhz >>> >>> (Resent, something apparently ate my first e-mail.) >>> >>> [0] - https://groups.google.com/forum/#!topic/scala-debate/OVGnIU2SNmc >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >
Re: Apache Spark chat channel
Since I'm a Scala Spark advocate, I'll try to get a Scala Spark Gitter channel created, one way or another. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Lightbend <http://lightbend.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Oct 6, 2016 at 8:36 AM, Sean Owen <so...@cloudera.com> wrote: > Yes this come up once in a while. There's no need or way to stop people > forming groups to chat, though blessing a new channel as 'official' is > tough because it means, in theory, everyone has to follow another channel > to see 100% of the discussion. I think that's why the couple of mailing > lists, which can be controlled and archived by the ASF, will stay the > official channels. But, naturally there's no problem with people forming > unofficial communities. > > On Thu, Oct 6, 2016 at 2:33 PM Jan-Hendrik Zab <j...@jhz.name> wrote: > >> Hello! >> >> There was a request on scala-debate [0] to create a Spark centric chat >> room under the scala namespace on Gitter with a focus on Scala related >> questions. >> >> This is just a heads up to the Apache Spark "management" to give them a >> chance to get involved. It might be better to create a dedicated channel >> under the Apache umbrella to better serve all users and not only those >> using Scala. Avoiding any artificial split of the Spark community. >> Reasons for having such a channel can be found in the linked thread. >> >> ps. >> Please CC me, since I'm not on the list. >> >> Best, >> -jhz >> >> (Resent, something apparently ate my first e-mail.) >> >> [0] - https://groups.google.com/forum/#!topic/scala-debate/OVGnIU2SNmc >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>
Re: Using Spark when data definitions are unknowable at compile time
I would start with using DataFrames and the Row <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row> API, because you can fetch fields by index. Presumably, you'll parse the incoming data and determine what fields have what types, etc. Or, will someone specify the schema dynamically some how? Either way, once you know the types and indices of the fields you need for a given query, you can fetch them using the Row methods. HTH, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Lightbend <http://lightbend.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Apr 28, 2016 at 11:34 AM, _na <nikhila.alb...@seeq.com> wrote: > We are looking to incorporate Spark into a timeseries data investigation > application, but we are having a hard time transforming our workflow into > the required transformations-on-data model. The crux of the problem is that > we don’t know a priori which data will be required for our transformations. > > For example, a common request might be `average($series2.within($ranges))`, > where in order to fetch the right sections of data from $series2, $ranges > will need to be computed first and then used to define data boundaries. > > Is there a way to get around the need to define data first in Spark? > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Spark-when-data-definitions-are-unknowable-at-compile-time-tp17371.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle
A few other reasons to drop 2.10 support sooner rather than later. - We at Lightbend are evaluating some fundamental changes to the REPL to make it work better for large heaps, especially for Spark. There are other recent and planned enhancements. This work will be benefit notebook users, too. However, we won't back port these improvements to 2.10. - Scala 2.12 is coming out midyear. It will require Java 8, which means it will produce dramatically smaller code (by exploiting lambdas instead of custom class generation for functions) and it will offer some performance improvements. Hopefully Spark will will support it as an optional Scala version relatively quickly after availability, which means it would be nice to avoid supporting 3 versions of Scala. Using Scala 2.10 at this point is like using Java 1.6, seriously out of date. If you're using libraries that still require 2.10, are you sure that library is being properly maintained? Or is it a legacy dependency that should be eliminated before it becomes a liability? Even if you can't upgrade Scala versions in the next few months, you can certainly continue using Spark 1.X until you're ready to upgrade. So, I recommend that Spark 2.0 drop Scala 2.10 support from the beginning. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Lightbend <http://lightbend.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Tue, Apr 5, 2016 at 8:54 PM, Kostas Sakellis <kos...@cloudera.com> wrote: > From both this and the JDK thread, I've noticed (including myself) that > people have different notions of compatibility guarantees between major and > minor versions. > A simple question I have is: What compatibility can we break between minor > vs. major releases? > > It might be worth getting on the same page wrt compatibility guarantees. > > Just a thought, > Kostas > > On Tue, Apr 5, 2016 at 4:39 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > >> One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is >> deprecation warnings in our builds that we can't fix without introducing a >> wrapper/ scala version specific code. This isn't a big deal, and if we drop >> 2.10 in the 3-6 month time frame talked about we can cleanup those warnings >> once we get there. >> >> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors < >> raymond.honderd...@sizmek.com> wrote: >> >>> What about a seperate branch for scala 2.10? >>> >>> >>> >>> Sent from my Samsung Galaxy smartphone. >>> >>> >>> Original message >>> From: Koert Kuipers <ko...@tresata.com> >>> Date: 4/2/2016 02:10 (GMT+02:00) >>> To: Michael Armbrust <mich...@databricks.com> >>> Cc: Matei Zaharia <matei.zaha...@gmail.com>, Mark Hamstra < >>> m...@clearstorydata.com>, Cody Koeninger <c...@koeninger.org>, Sean >>> Owen <so...@cloudera.com>, dev@spark.apache.org >>> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x >>> lifecycle >>> >>> as long as we don't lock ourselves into supporting scala 2.10 for the >>> entire spark 2 lifespan it sounds reasonable to me >>> >>> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> +1 to Matei's reasoning. >>>> >>>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia <matei.zaha...@gmail.com >>>> > wrote: >>>> >>>>> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the >>>>> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's >>>>> the default version we built with in 1.x. We want to make the transition >>>>> from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default >>>>> downloads be for Scala 2.11, so people will more easily move, but we >>>>> shouldn't create obstacles that lead to fragmenting the community and >>>>> slowing down Spark 2.0's adoption. I've seen companies that stayed on an >>>>> old Scala version for multiple years because switching it, or mixing >>>>> versions, would affect the company's entire codebase. >>>>> >>>>> Matei >>>>> >>>>> On Mar 30, 2016, at 12:08 PM, Koert Kuipers <ko...@tresata.com> wrote: >>>>> >>>>> oh wow, had no idea it got ripped out >>>>> >>>>> On Wed, Mar 30, 2016 at 11:50 AM, Mark H
Re: Akka with Spark
As Reynold said, you can still use Akka with Spark, but now it's more like using any third-party library that isn't already a Spark dependency (at least once the current Akka dependency is fully removed). Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, Dec 27, 2015 at 4:06 AM, Disha Shrivastava <dishu@gmail.com> wrote: > Hi All, > > I need an Akka like framework to implement model parallelism in neural > networks, an architecture similar to that given in the link > http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html. > I need to divide a big neural network ( which can't fit into the memory of > one machine) layer by layer and do message passing across actors which are > distributed across different worker machines. I found Akka to be most > suitable for the job. > > Please suggest if it can be done by any other suitable frameworks. > > Regards, > Disha > > On Sun, Dec 27, 2015 at 1:04 PM, Reynold Xin <r...@databricks.com> wrote: > >> We are just removing Spark's dependency on Akka. It has nothing to do >> with whether user applications can use Akka or not. As a matter of fact, by >> removing the Akka dependency from Spark, it becomes easier for user >> applications to use Akka, because there is no more dependency conflict. >> >> For more information, see >> https://issues.apache.org/jira/browse/SPARK-5293 >> >> On Sat, Dec 26, 2015 at 9:31 PM, Soumya Simanta <soumya.sima...@gmail.com >> > wrote: >> >>> >>> >>> Any rationale for removing Akka from Spark ? Also, what is the >>> replacement ? >>> >>> Thanks >>> >>> On Dec 27, 2015, at 8:31 AM, Dean Wampler <deanwamp...@gmail.com> wrote: >>> >>> Note that Akka is being removed from Spark. Even if it weren't, I would >>> consider keeping Akka processes separate from Spark processes, so you can >>> monitor, debug, and scale them independently. So consider streaming data >>> from Akka to Spark Streaming or go the other way, from Spark to Akka >>> Streams. >>> >>> dean >>> >>> Dean Wampler, Ph.D. >>> Author: Programming Scala, 2nd Edition >>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >>> Typesafe <http://typesafe.com> >>> @deanwampler <http://twitter.com/deanwampler> >>> http://polyglotprogramming.com >>> >>> On Sat, Dec 26, 2015 at 12:54 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Do you mind sharing your use case ? >>>> >>>> It may be possible to use a different approach than Akka. >>>> >>>> Cheers >>>> >>>> On Sat, Dec 26, 2015 at 10:08 AM, Disha Shrivastava < >>>> dishu@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I wanted to know how to use Akka framework with Spark starting from >>>>> basics. I saw online that Spark uses Akka framework but I am not really >>>>> sure if I can define Actors and use it in Spark. >>>>> >>>>> Also, how to integrate Akka with Spark as in how will I know how many >>>>> Akka actors are running on each of my worker machines? Can I control that? >>>>> >>>>> Please help. The only useful resource which I could find online was >>>>> Akka with Spark Streaming which was also not very clear. >>>>> >>>>> Thanks, >>>>> >>>>> Disha >>>>> >>>> >>>> >>> >> >
Re: Akka with Spark
Note that Akka is being removed from Spark. Even if it weren't, I would consider keeping Akka processes separate from Spark processes, so you can monitor, debug, and scale them independently. So consider streaming data from Akka to Spark Streaming or go the other way, from Spark to Akka Streams. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sat, Dec 26, 2015 at 12:54 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Do you mind sharing your use case ? > > It may be possible to use a different approach than Akka. > > Cheers > > On Sat, Dec 26, 2015 at 10:08 AM, Disha Shrivastava <dishu@gmail.com> > wrote: > >> Hi, >> >> I wanted to know how to use Akka framework with Spark starting from >> basics. I saw online that Spark uses Akka framework but I am not really >> sure if I can define Actors and use it in Spark. >> >> Also, how to integrate Akka with Spark as in how will I know how many >> Akka actors are running on each of my worker machines? Can I control that? >> >> Please help. The only useful resource which I could find online was Akka >> with Spark Streaming which was also not very clear. >> >> Thanks, >> >> Disha >> > >
Re: [ANNOUNCE] Spark 1.6.0 Release Preview
utor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 15 more 15/11/23 13:04:56 WARN NettyRpcEndpointRef: Ignore message Success(HeartbeatResponse(false)) [Stage 1:=> (2204 + 6) / 10] [Stage 1:=> (2858 + 4) / 10] [Stage 1:=> (3616 + 5) / 10] ... elided ... [Stage 1:=>(98393 + 4) / 10] [Stage 1:=>(99347 + 4) / 10] [Stage 1:=====>(99734 + 4) / 10] res1: Long = 100 scala> Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, Nov 22, 2015 at 4:21 PM, Michael Armbrust <mich...@databricks.com> wrote: > In order to facilitate community testing of Spark 1.6.0, I'm excited to > announce the availability of an early preview of the release. This is not a > release candidate, so there is no voting involved. However, it'd be awesome > if community members can start testing with this preview package and report > any problems they encounter. > > This preview package contains all the commits to branch-1.6 > <https://github.com/apache/spark/tree/branch-1.6> till commit > 308381420f51b6da1007ea09a02d740613a226e0 > <https://github.com/apache/spark/tree/v1.6.0-preview2>. > > The staging maven repository for this preview build can be found here: > https://repository.apache.org/content/repositories/orgapachespark-1162 > > Binaries for this preview build can be found here: > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-bin/ > > A build of the docs can also be found here: > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-docs/ > > The full change log for this release can be found on JIRA > <https://issues.apache.org/jira/browse/SPARK-11908?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.6.0> > . > > *== How can you help? ==* > > If you are a Spark user, you can help us test this release by taking a > Spark workload and running on this preview release, then reporting any > regressions. > > *== Major Features ==* > > When testing, we'd appreciate it if users could focus on areas that have > changed in this release. Some notable new features include: > > SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> *Parquet > Performance* - Improve Parquet scan performance when using flat schemas. > SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810> *Session * > *Management* - Multiple users of the thrift (JDBC/ODBC) server now have > isolated sessions including their own default database (i.e USE mydb) > even on shared clusters. > SPARK- <https://issues.apache.org/jira/browse/SPARK-> *Dataset > API* - A new, experimental type-safe API (similar to RDDs) that performs > many operations on serialized binary data and code generation (i.e. Project > Tungsten) > SPARK-1 <https://issues.apache.org/jira/browse/SPARK-1> *Unified > Memory Management* - Shared memory for execution and caching instead of > exclusive division of the regions. > SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> *Datasource > API Avoid Double Filter* - When implementing a datasource with filter > pushdown, developers can now tell Spark SQL to avoid double evaluating a > pushed-down filter. > SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> *New > improved state management* - trackStateByKey - a DStream transformation > for stateful stream processing, supersedes updateStateByKey in > functionality and performance. > > Happy testing! > > Michael > >
Re: Removing the Mesos fine-grained mode
Sounds like the right move. Simplifies things in important ways. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Nov 19, 2015 at 5:42 AM, Iulian Dragoș <iulian.dra...@typesafe.com> wrote: > Hi all, > > Mesos is the only cluster manager that has a fine-grained mode, but it's > more often than not problematic, and it's a maintenance burden. I'd like to > suggest removing it in the 2.0 release. > > A few reasons: > > - code/maintenance complexity. The two modes duplicate a lot of > functionality (and sometimes code) that leads to subtle differences or > bugs. See SPARK-10444 <https://issues.apache.org/jira/browse/SPARK-10444> and > also this thread > <https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3ccalxmp-a+aygnwsiytm8ff20-mgwhykbhct94a2hwzth1jwh...@mail.gmail.com%3E> > and MESOS-3202 <https://issues.apache.org/jira/browse/MESOS-3202> > - it's not widely used (Reynold's previous thread > <http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html> > got very few responses from people relying on it) > - similar functionality can be achieved with dynamic allocation + > coarse-grained mode > > I suggest that Spark 1.6 already issues a warning if it detects > fine-grained use, with removal in the 2.0 release. > > Thoughts? > > iulian > >
Re: [discuss] ending support for Java 6?
FWIW, another reason to start planning for deprecation of Java 7, too, is that Scala 2.12 will require Java 8. Scala 2.12 will be released early next year. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 30, 2015 at 3:37 PM, Ted Yu yuzhih...@gmail.com wrote: +1 on ending support for Java 6. BTW from https://www.java.com/en/download/faq/java_7.xml : After April 2015, Oracle will no longer post updates of Java SE 7 to its public download sites. On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I'm in favor of ending support for Java 6. We should also articulate a policy on how long we want to support current and future versions of Java after Oracle declares them EOL (Java 7 will be in that bucket in a matter of days). Punya On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote: something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: Need advice for Spark newbie
Historically, many orgs. have replaced data warehouses with Hadoop clusters and used Hive along with Impala (on Cloudera deployments) or Drill (on MapR deployments) for SQL. Hive is older and slower, while Impala and Drill are newer and faster, but you typically need both for their complementary features, at least today. Spark and Spark SQL are not yet complete replacements for them, but they'll get there over time. The good news is, you can mix and match these tools, as appropriate, because they can all work with the same datasets. The challenge is all the tribal knowledge required to setup and manage Hadoop clusters, to properly organize your data for best performance for your needs, to use all these tools effectively, along with additional Hadoop ETL tools, etc. Fortunately, tools like Tableau are already integrated here. However, none of this will be as polished and integrated as what you're used to. You're trading that polish for greater scalability and flexibility. HTH. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm a newbie when it comes to Spark and Hadoop eco system in general. Our team has been predominantly a Microsoft shop that uses MS stack for most of their BI needs. So we are talking SQL server for storing relational data and SQL Server Analysis services for building MOLAP cubes for sub-second query analysis. Lately, we have been hitting degradation in our cube query response times as our data sizes grew considerably the past year. We are talking fact tables which are in 1o-100 billions of rows range and a few dimensions in the 10-100's of millions of rows. We tried vertically scaling up our SSAS server but queries are still taking few minutes. In light of this, I was entrusted with task of figuring out an open source solution that would scale to our current and future needs for data analysis. I looked at a bunch of open source tools like Apache Drill, Druid, AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the first step given it's recent rise in popularity and growing eco-system around it. Since we are also interested in doing deep data analysis like machine learning and graph algorithms on top our data, spark seems to be a good solution. I would like to build out a POC for our MOLAP cubes using spark with HDFS/Hive as the datasource and see how it scales for our queries/measures in real time with real data. Roughly, these are the requirements for our team 1. Should be able to create facts, dimensions and measures from our data sets in an easier way. 2. Cubes should be query able from Excel and Tableau. 3. Easily scale out by adding new nodes when data grows 4. Very less maintenance and highly stable for production level workloads 5. Sub second query latencies for COUNT DISTINCT measures (since majority of our expensive measures are of this type) . Are ok with Approx Distinct counts for better perf. So given these requirements, is Spark the right solution to replace our on-premise MOLAP cubes? Are there any tutorials or documentation on how to build cubes using Spark? Is that even possible? or even necessary? As long as our users can pivot/slice dice the measures quickly from client tools by dragging dropping dimensions into rows/columns w/o the need to join to fact table, we are ok with however the data is laid out. Doesn't have to be a cube. It can be a flat file in hdfs for all we care. I would love to chat with some one who has successfully done this kind of migration from OLAP cubes to Spark in their team or company . This is it for now. Looking forward to a great discussion. P.S. We have decided on using Azure HDInsight as our managed hadoop system in the cloud.
Re: Need advice for Spark newbie
There's no support for star or snowflake models, per se. What you get with Hadoop is access to all your data and the processing power to build the ad hoc queries you want, when you need them, rather than having to figure out a schema/model in advance. I recommend that you also ask your questions on one of the Hadoop or Hive user mailing lists, where you'll find people who have moved data warehouses to Hadoop. Then you can use Spark for some of the tasks you'll do. This dev (developer) mailing list isn't really the place to discuss this anyway. (The user list would be slightly better.) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Feb 26, 2015 at 3:23 PM, Vikram Kone vikramk...@gmail.com wrote: Dean Thanks for the info. Are you saying that we can create star/snowflake data models using spark so they can be queried from tableau ? On Thursday, February 26, 2015, Dean Wampler deanwamp...@gmail.com wrote: Historically, many orgs. have replaced data warehouses with Hadoop clusters and used Hive along with Impala (on Cloudera deployments) or Drill (on MapR deployments) for SQL. Hive is older and slower, while Impala and Drill are newer and faster, but you typically need both for their complementary features, at least today. Spark and Spark SQL are not yet complete replacements for them, but they'll get there over time. The good news is, you can mix and match these tools, as appropriate, because they can all work with the same datasets. The challenge is all the tribal knowledge required to setup and manage Hadoop clusters, to properly organize your data for best performance for your needs, to use all these tools effectively, along with additional Hadoop ETL tools, etc. Fortunately, tools like Tableau are already integrated here. However, none of this will be as polished and integrated as what you're used to. You're trading that polish for greater scalability and flexibility. HTH. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm a newbie when it comes to Spark and Hadoop eco system in general. Our team has been predominantly a Microsoft shop that uses MS stack for most of their BI needs. So we are talking SQL server for storing relational data and SQL Server Analysis services for building MOLAP cubes for sub-second query analysis. Lately, we have been hitting degradation in our cube query response times as our data sizes grew considerably the past year. We are talking fact tables which are in 1o-100 billions of rows range and a few dimensions in the 10-100's of millions of rows. We tried vertically scaling up our SSAS server but queries are still taking few minutes. In light of this, I was entrusted with task of figuring out an open source solution that would scale to our current and future needs for data analysis. I looked at a bunch of open source tools like Apache Drill, Druid, AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the first step given it's recent rise in popularity and growing eco-system around it. Since we are also interested in doing deep data analysis like machine learning and graph algorithms on top our data, spark seems to be a good solution. I would like to build out a POC for our MOLAP cubes using spark with HDFS/Hive as the datasource and see how it scales for our queries/measures in real time with real data. Roughly, these are the requirements for our team 1. Should be able to create facts, dimensions and measures from our data sets in an easier way. 2. Cubes should be query able from Excel and Tableau. 3. Easily scale out by adding new nodes when data grows 4. Very less maintenance and highly stable for production level workloads 5. Sub second query latencies for COUNT DISTINCT measures (since majority of our expensive measures are of this type) . Are ok with Approx Distinct counts for better perf. So given these requirements, is Spark the right solution to replace our on-premise MOLAP cubes? Are there any tutorials or documentation on how to build cubes using Spark? Is that even possible? or even necessary? As long as our users can pivot/slice dice the measures quickly from client tools by dragging dropping dimensions into rows/columns w/o the need to join to fact table, we are ok with however the data is laid out. Doesn't have to be a cube. It can be a flat file in hdfs for all we care. I would love to chat with some one who has successfully done this kind of migration from OLAP cubes to Spark in their team or company . This is it for now
Re: best IDE for scala + spark development?
For what it's worth, I use Sublime Text + the SBT console for everything. I can live without the extra IDE features. However, if you like an IDE, the Eclipse Scala IDE 4.0 RC1 is a big improvement over previous releases. For one thing, it can now supports projects using different versions of Scala, which is convenient for Spark's current 2.10.4 support and emerging 2.11 support. http://scala-ide.org/download/milestone.html Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Sun, Oct 26, 2014 at 5:06 PM, Duy Huynh duy.huynh@gmail.com wrote: i like intellij and eclipse too, but some that they are too heavy. i would love to use vim. are there are good scala plugins for vim? (i.e code completion, scala doc, etc) On Sun, Oct 26, 2014 at 12:32 PM, Jay Vyas jayunit100.apa...@gmail.com wrote: I tried the scala eclipse ide but in scala 2.10 I ran into some weird issues http://stackoverflow.com/questions/24253084/scalaide-and-cryptic-classnotfound-errors ... So I switched to IntelliJ and was much more satisfied... I've written a post on how I use fedora,sbt, and intellij for spark apps. http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html?m=1 The IntelliJ sbt plugin is imo less buggy then the eclipse scalaIDE stuff. For example, I found I had to set some special preferences Finally... given sbts automated recompile option, if you just use tmux, and vim nerdtree, with sbt , you could come pretty close to something like an IDE without all the drama .. On Oct 26, 2014, at 11:07 AM, ll duy.huynh@gmail.com wrote: i'm new to both scala and spark. what IDE / dev environment do you find most productive for writing code in scala with spark? is it just vim + sbt? or does a full IDE like intellij works out better? thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SparkSubmit and --driver-java-options
Try this: #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=($@) # Make ARGS_COPY an array with the array elements in $@ for x in ${ARGS_COPY[@]}; do# preserve array arguments. echo arg_copy: $x done On Wed, Apr 30, 2014 at 3:51 PM, Patrick Wendell pwend...@gmail.com wrote: So I reproduced the problem here: == test.sh == #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=$@ for x in $ARGS_COPY; do echo arg_copy: $x done == ./test.sh a b c d e f arg: a arg: b arg: c d e arg: f arg_copy: a b c d e f I'll dig around a bit more and see if we can fix it. Pretty sure we aren't passing these argument arrays around correctly in bash. On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. I tried a few different approaches but finally ended up giving up; my bash-fu is apparently not strong enough. If you can make it work great, but I have -J working locally in case you give up like me. :-) -- Marcelo -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark 1.0.0 rc3
Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com -- Dean Wampler, Ph.D. Typesafe @deanwampler