Re: Can I add a new method to RDD class?

2016-12-06 Thread Jakob Odersky
uot; add > new RDD methods in. > > How can I specify a custom version? modify version numbers in all the > pom.xml file? > > > > On Dec 5, 2016, at 9:12 PM, Jakob Odersky <ja...@odersky.com> wr

Re: Can I add a new method to RDD class?

2016-12-05 Thread Jakob Odersky
It looks like you're having issues with including your custom spark version (with the extensions) in your test project. To use your local spark version: 1) make sure it has a custom version (let's call it 2.1.0-CUSTOM) 2) publish it to your local machine with `sbt publishLocal` 3) include the

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-04 Thread Jakob Odersky
Hi everyone, is there any ongoing discussion/documentation on the redesign of sinks? I think it could be a good thing to abstract away the underlying streaming model, however that isn't directly related to Holden's first point. The way I understand it, is to slightly change the DataStreamWriter

Re: Running Spark master/slave instances in non Daemon mode

2016-10-03 Thread Jakob Odersky
> command and binds to the output fds from that process, so daemonizing is > causing us minor hardship and seems like an easy thing to make optional. > We'd be happy to make the PR as well. > > --Mike > > On Thu, Sep 29, 2016 at 5:25 PM, Jakob Odersky <ja...@odersky

Re: java.util.NoSuchElementException when serializing Map with default value

2016-10-03 Thread Jakob Odersky
Hi Kabeer, which version of Spark are you using? I can't reproduce the error in latest Spark master. regards, --Jakob - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread Jakob Odersky
I'm curious, what kind of container solutions require foreground processes? Most init systems work fine with "starter" processes that run other processes. IIRC systemd and start-stop-daemon have an option called "fork", that will expect the main process to run another one in the background and

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Jakob Odersky
I agree with Sean's answer, you can check out the relevant serializer here https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen wrote: > My guess is that Kryo specially handles

Re: What's the use of RangePartitioner.hashCode

2016-09-22 Thread Jakob Odersky
Hash codes should try to avoid collisions of objects that are not equal. Integer overflowing is not an issue by itself On Wed, Sep 21, 2016 at 10:49 PM, WangJianfei wrote: > Than you very much sir! but what i want to know is whether the hashcode > overflow will

Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread Jakob Odersky
t a.hashCode == b.hashCode when > a.equals(b), the bidirectional case is usually harder to satisfy due to > possibility of collisions. > > Good info: > http://www.programcreek.com/2011/07/java-equals-and-hashcode-contract/ > _____ > From: Jakob Odersky <

Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread Jakob Odersky
Hi, It is used jointly with a custom implementation of the `equals` method. In Scala, you can override the `equals` method to change the behaviour of `==` comparison. On example of this would be to compare classes based on their parameter values (i.e. what case classes do). Partitioners aren't

Re: java.lang.NoClassDefFoundError, is this a bug?

2016-09-21 Thread Jakob Odersky
Hi Xiang, this error also appears in client mode (maybe the situation that you were referring to and that worked was local mode?), however the error is expected and is not a bug. this line in your snippet: object Main extends A[String] { //... is, after desugaring, equivalent to: object

Re: Test fails when compiling spark with tests

2016-09-13 Thread Jakob Odersky
There are some flaky tests that occasionally fail, my first recommendation would be to re-run the test suite. Another thing to check is if there are any applications listening to spark's default ports. Btw, what is your environment like? In case it is windows, I don't think tests are regularly run

Re: @scala.annotation.varargs or @_root_.scala.annotation.varargs?

2016-09-08 Thread Jakob Odersky
+1 to Sean's answer, importing varargs. In this case the _root_ is also unnecessary (it would be required in case you were using it in a nested package called "scala" itself) On Thu, Sep 8, 2016 at 9:27 AM, Sean Owen wrote: > I think the @_root_ version is redundant because >

Re: help getting started

2016-09-02 Thread Jakob Odersky
Hi Dayne, you can look at this page for some starter issues: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened). Also check out this guide on how to contribute to Spark

Re: SBT doesn't pick resource file after clean

2016-05-20 Thread Jakob Odersky
implemented. > > However, even on generating the file under the default resourceDirectory => > core/src/resources doesn't pick the file in jar after doing a clean. So this > seems to be a different issue. > > > > > > On Thu, May 19, 2016 at 4:17 PM, Jakob Oders

Re: SBT doesn't pick resource file after clean

2016-05-19 Thread Jakob Odersky
To echo my comment on the PR: I think the "sbt way" to add extra, generated resources to the classpath is by adding a new task to the `resourceGenerators` setting. Also, the task should output any files into the directory specified by the `resourceManaged` setting. See

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Jakob Odersky
I just found out how the hash is calculated: gpg --print-md sha512 .tgz you can use that to check if the resulting output matches the contents of .tgz.sha On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com> wrote: > The published hash is a SHA512. > > You can verif

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Jakob Odersky
Is someone going to retry fixing these packages? It's still a problem. >>>> >>>> Also, it would be good to understand why this is happening. >>>> >>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote: >>

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky
I mean from the perspective of someone developing Spark, it makes things more complicated. It's just my point of view, people that actually support Spark deployments may have a different opinion ;) On Thu, Mar 24, 2016 at 2:41 PM, Jakob Odersky <ja...@odersky.com> wrote: > You can, but s

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky
You can, but since it's going to be a maintainability issue I would argue it is in fact a problem. On Thu, Mar 24, 2016 at 2:34 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Hi Jakob, > > On Thu, Mar 24, 2016 at 2:29 PM, Jakob Odersky <ja...@odersky.com> wrote: &

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky
Reynold's 3rd point is particularly strong in my opinion. Supporting Scala 2.12 will require Java 8 anyway, and introducing such a change is probably best done in a major release. Consider what would happen if Spark 2.0 doesn't require Java 8 and hence not support Scala 2.12. Will it be stuck on

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Jakob Odersky
I just experienced the issue, however retrying the download a second time worked. Could it be that there is some load balancer/cache in front of the archive and some nodes still serve the corrupt packages? On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas wrote: > I'm

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Jakob Odersky
com> wrote: > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt ZIP > file. > > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same Spark > 1.6.1/Hadoop 2.6 package you had a success with? > > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersk

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Jakob Odersky
I would recommend (non-binding) option 1. Apart from the API breakage I can see only advantages, and that sole disadvantage is minimal for a few reasons: 1. the DataFrame API has been "Experimental" since its implementation, so no stability was ever implied 2. considering that the change is for

Re: Scala 2.11 default build

2016-02-01 Thread Jakob Odersky
Awesome! +1 on Steve Loughran's question, how does this affect support for 2.10? Do future contributions need to work with Scala 2.10? cheers On Mon, Feb 1, 2016 at 7:02 AM, Ted Yu wrote: > The following jobs have been established for build against Scala 2.10: > >

Re: spark job scheduling

2016-01-27 Thread Jakob Odersky
Nitpick: the up-to-date version of said wiki page is https://spark.apache.org/docs/1.6.0/job-scheduling.html (not sure how much it changed though) On Wed, Jan 27, 2016 at 7:50 PM, Chayapan Khannabha wrote: > I would start at this wiki page >

Mutiple spark contexts

2016-01-27 Thread Jakob Odersky
A while ago, I remember reading that multiple active Spark contexts per JVM was a possible future enhancement. I was wondering if this is still on the roadmap, what the major obstacles are and if I can be of any help in adding this feature? regards, --Jakob

Re: Fastest way to build Spark from scratch

2015-12-07 Thread Jakob Odersky
make-distribution and the second code snippet both create a distribution from a clean state. They therefore require that every source file be compiled and that takes time (you can maybe tweak some settings or use a newer compiler to gain some speed). I'm inferring from your question that for your

Datasets on experimental dataframes?

2015-11-23 Thread Jakob Odersky
Hi, datasets are being built upon the experimental DataFrame API, does this mean DataFrames won't be experimental in the near future? thanks, --Jakob

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11

Re: State of the Build

2015-11-06 Thread Jakob Odersky
it will change. >> > >> > Any improvements for the sbt build are of course welcome (it is still >> used >> > by many developers), but i would not do anything that increases the >> burden >> > of maintaining two build systems. >> > >> &g

State of the Build

2015-11-05 Thread Jakob Odersky
Hi everyone, in the process of learning Spark, I wanted to get an overview of the interaction between all of its sub-projects. I therefore decided to have a look at the build setup and its dependency management. Since I am alot more comfortable using sbt than maven, I decided to try to port the

Re: Insight into Spark Packages

2015-10-16 Thread Jakob Odersky
[repost to mailing list] I don't know much about packages, but have you heard about the sbt-spark-package plugin? Looking at the code, specifically https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackagePlugin.scala, might give you insight on the

Re: Spark Event Listener

2015-10-13 Thread Jakob Odersky
the path of the source file defining the event API is `core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala` On 13 October 2015 at 16:29, Jakob Odersky <joder...@gmail.com> wrote: > Hi, > I came across the spark listener API while checking out possible UI > extensi

Spark Event Listener

2015-10-13 Thread Jakob Odersky
Hi, I came across the spark listener API while checking out possible UI extensions recently. I noticed that all events inherit from a sealed trait `SparkListenerEvent` and that a SparkListener has a corresponding `onEventXXX(event)` method for every possible event. Considering that events inherit

Live UI

2015-10-12 Thread Jakob Odersky
Hi everyone, I am just getting started working on spark and was thinking of a first way to contribute whilst still trying to wrap my head around the codebase. Exploring the web UI, I noticed it is a classic request-response website, requiring manual refresh to get the latest data. I think it