Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi Reynold, Is there any way to know when an executor will no longer have any tasks? It seems to me there is no timeout which is appropriate that is long enough to ensure that no more tasks will be scheduled on the executor, and short enough to be appropriate to wait on during an interactive

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran
Spark has hit one of the enternal problems of OSS projects, one hit by: ant, maven, hadoop, ... anything with a plugin model. Take in the plugin: you're in control, but also down for maintenance Leave out the plugin: other people can maintain it, be more agile, etc. But you've lost control,

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Hari Shreedharan
I have worked with various ASF projects for 4+ years now. Sure, ASF projects can delete code as they feel fit. But this is the first time I have really seen code being "moved out" of a project without discussion. I am sure you can do this without violating ASF policy, but the explanation for that

Spark build with scala-2.10 fails ?

2016-03-19 Thread Jeff Zhang
Anyone can pass the spark build with scala-2.10 ? [info] Compiling 475 Scala sources and 78 Java sources to /Users/jzhang/github/spark/core/target/scala-2.10/classes... [error] /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:30:

Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
There is no way to really know that, because users might run queries at any given point. BTW why can't your threads be just daemon threads? On Wed, Mar 16, 2016 at 3:29 PM, Dan Burkert wrote: > Hi Reynold, > > Is there any way to know when an executor will no longer have

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Imran Rashid
On Thu, Mar 17, 2016 at 2:55 PM, Cody Koeninger wrote: > Why would a PMC vote be necessary on every code deletion? > Certainly PMC votes are not necessary on *every* code deletion. I dont' think there is a very clear rule on when such discussion is warranted, just a soft

Re: Various forks

2016-03-19 Thread Xiangrui Meng
We made that fork to hide package private classes/members in the generated Java API doc. Otherwise, the Java API doc is very messy. The patch is to map all private[*] to the default scope in the generated Java code. However, this might not be the expected behavior for other packages. So it didn't

Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi Steve, I referenced the ShutdownHookManager in my original message, but it appears to be an internal-only API. Looks like it uses a Hadoop equivalent internally, though, so I'll look into using that. Good tip about timeouts, thanks. - Dan On Thu, Mar 17, 2016 at 5:02 AM, Steve Loughran

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger
There's a difference between "without discussion" and "without as much discussion as I would have liked to have a chance to notice it". There are plenty of PRs that got merged before I noticed them that I would rather have not gotten merged. As far as group / artifact name compatibility, at least

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
On Thu, Mar 17, 2016 at 12:01 PM, Cody Koeninger wrote: > i. An ASF project can clearly decide that some of its code is no > longer worth maintaining and delete it. This isn't really any > different. It's still apache licensed so ultimately whoever wants the > code can get

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
On Fri, Mar 18, 2016 at 10:09 AM, Jean-Baptiste Onofré wrote: > a project can have multiple repos: it's what we have in ServiceMix, in > Karaf. > For the *-extra on github, if the code has been in the ASF, the PMC members > have to vote to move the code on *-extra. That's

Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
Maybe just add a watch dog thread and closed the connection upon some timeout? On Wednesday, March 16, 2016, Dan Burkert wrote: > Hi all, > > I'm working on the Spark connector for Apache Kudu, and I've run into an > issue that is a bit beyond my Spark knowledge. The Kudu

SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Hello all, Recently a lot of the streaming backends were moved to a separate project on github and removed from the main Spark repo. While I think the idea is great, I'm a little worried about the execution. Some concerns were already raised on the bug mentioned above, but I'd like to have a

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
No, I didn't yet - feel free to create a JIRA. On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann wrote: > Hi Nick, > > Thanks again for your help with this. Did you create a ticket in JIRA for > investigating sparse models in LR and / or multivariate summariser? If so,

Re: pull request template

2016-03-19 Thread Bryan Cutler
+1 on Marcelo's comments. It would be nice not to pollute commit messages with the instructions because some people might forget to remove them. Nobody has suggested removing the template. On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley wrote: > +1 for keeping the

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan
I was not aware of a discussion in Dev list about this - agree with most of the observations. In addition, I did not see PMC signoff on moving (sub-)modules out. Regards Mridul On Thursday, March 17, 2016, Marcelo Vanzin wrote: > Hello all, > > Recently a lot of the

PySpark API divergence + improving pandas interoperability

2016-03-19 Thread Wes McKinney
hi everyone, I've recently gotten moving on solving some of the low-level data interoperability problems between Python's NumPy-focused scientific computing and data libraries like pandas and the rest of the big data ecosystem, Spark being a very important part of that. One of the major efforts

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
I tried again this morning : $ wget https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz --2016-03-18 07:55:30-- https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz Resolving s3.amazonaws.com... 54.231.19.163 ... $ tar zxf

Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Pete Robbins
There are several open Jiras to add new Sinks OpenTSDB https://issues.apache.org/jira/browse/SPARK-12194 StatsD https://issues.apache.org/jira/browse/SPARK-11574 Kafka https://issues.apache.org/jira/browse/SPARK-13392 Some have PRs from 2015 so I'm assuming there is not the desire to integrate

Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz Does anyone else have trouble unzipping this? How did this happen? What I get is: $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file gzip:

SparkContext.stop() takes too long to complete

2016-03-19 Thread Nezih Yigitbasi
Hi Spark experts, I am using Spark 1.5.2 on YARN with dynamic allocation enabled. I see in the driver/application master logs that the app is marked as SUCCEEDED and then SparkContext stop is called. However, this stop sequence takes > 10 minutes to complete, and YARN resource manager kills the

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Adam Kocoloski
> On Mar 19, 2016, at 8:32 AM, Steve Loughran wrote: > > >> On 18 Mar 2016, at 17:07, Marcelo Vanzin wrote: >> >> Hi Steve, thanks for the write up. >> >> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran >> wrote: >>> If

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Shane Curcuru
Marcelo Vanzin wrote earlier: > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. Question: why was the code removed from the Spark repo? What's the harm in keeping it available here? The ASF is perfectly happy if anyone

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Jean-Baptiste Onofré
Hi Marcelo, a project can have multiple repos: it's what we have in ServiceMix, in Karaf. For the *-extra on github, if the code has been in the ASF, the PMC members have to vote to move the code on *-extra. Regards JB On 03/18/2016 06:07 PM, Marcelo Vanzin wrote: Hi Steve, thanks for

Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Gerard Maas
+1 On Mar 19, 2016 08:33, "Pete Robbins" wrote: > This seems to me to be unnecessarily restrictive. These are very useful > extension points for adding 3rd party sources and sinks. > > I intend to make an Elasticsearch sink available on spark-packages but > this will require

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran
> On 18 Mar 2016, at 22:24, Marcelo Vanzin wrote: > > On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann wrote: >> So, my comment here is that any code *cannot* be removed from an Apache >> project if there is a VETO issued which so far I haven't seen,

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran
> On 18 Mar 2016, at 17:07, Marcelo Vanzin wrote: > > Hi Steve, thanks for the write up. > > On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran > wrote: >> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs >> to go through

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran
> On 17 Mar 2016, at 21:33, Marcelo Vanzin wrote: > > Hi Reynold, thanks for the info. > > On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: >> If one really feels strongly that we should go through all the overhead to >> setup an ASF subproject for

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
On Linux, I got: $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > >

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Sean Owen
Code can be removed from an ASF project. That code can live on elsewhere (in accordance with the license) It can't be presented as part of the official ASF project, like any other 3rd party project The package name certainly must change from org.apache.spark I don't know of a protocol, but

Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Thanks for the replies, responses inline: On Wed, Mar 16, 2016 at 3:36 PM, Reynold Xin wrote: > There is no way to really know that, because users might run queries at > any given point. > > BTW why can't your threads be just daemon threads? > The bigger issue is that we

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Note the non-kafka bug was filed right before the change was pushed. So there really wasn't any discussion before the decision was made to remove that code. I'm just trying to merge both discussions here in the list where it's a little bit more dynamic than bug updates that end up getting lost in

Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread satyajit vegesna
Hi, Scala version:2.11.7(had to upgrade the scala verison to enable case clasess to accept more than 22 parameters.) Spark version:1.6.1. PFB pom.xml Getting below error when trying to setup spark on intellij IDE, 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case

graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi all, I'm working on the Spark connector for Apache Kudu, and I've run into an issue that is a bit beyond my Spark knowledge. The Kudu connector internally holds an open connection to the Kudu cluster

Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames

2016-03-19 Thread Tim Hunter
Hello all, I would like to bring your attention to a small project to integrate TensorFlow with Apache Spark, called TensorFrames. With this library, you can map, reduce or aggregate numerical data stored in Spark dataframes using TensorFlow computation graphs. It is published as a Spark package

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Also, just wanted to point out something: On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: > Thanks for initiating this discussion. I merged the pull request because it > was unblocking another major piece of work for Spark 2.0: not requiring > assembly jars While I do

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen wrote: > See the instructions

Re: pull request template

2016-03-19 Thread Reynold Xin
I think it'd make sense to have the merge script automatically remove some parts of the template, if they were not removed by the contributor. That seems trivial to do. On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley wrote: > +1 for keeping the template > > I figure any

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
OK cool. I'll test the hadoop-2.6 package and check back here if it's still broken. Just curious: How did those packages all get corrupted (if we know)? Seems like a strange thing to happen. 2016년 3월 17일 (목) 오전 11:57, Michael Armbrust 님이 작성: > Patrick reuploaded the

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Mridul Muralidharan
We use it in executors to get to : a) spark conf (for getting to hadoop config in map doing custom writing of side-files) b) Shuffle manager (to get shuffle reader) Not sure if there are alternative ways to get to these. Regards, Mridul On Wed, Mar 16, 2016 at 2:52 PM, Reynold Xin

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Reynold Xin
Thanks for initiating this discussion. I merged the pull request because it was unblocking another major piece of work for Spark 2.0: not requiring assembly jars, which is arguably a lot more important than sources that are less frequently used. I take full responsibility for that. I think it's

Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Pete Robbins
This seems to me to be unnecessarily restrictive. These are very useful extension points for adding 3rd party sources and sinks. I intend to make an Elasticsearch sink available on spark-packages but this will require a single class, the sink, to be in the org.apache.spark package tree. I could

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan
I am not referring to code edits - but to migrating submodules and code currently in Apache Spark to 'outside' of it. If I understand correctly, assets from Apache Spark are being moved out of it into thirdparty external repositories - not owned by Apache. At a minimum, dev@ discussion (like this

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
Looks like the other packages may also be corrupt. I’m getting the same error for the Spark 1.6.1 / Hadoop 2.4 package. https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz Nick ​ On Wed, Mar 16, 2016 at 8:28 PM Ted Yu wrote: > On Linux, I got: >

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
Same with hadoop 2.3 tar ball: $ tar zxf spark-1.6.1-bin-hadoop2.3.tgz gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now On Wed, Mar 16, 2016 at 5:47 PM, Nicholas Chammas < nicholas.cham...@gmail.com>

Re: graceful shutdown in external data sources

2016-03-19 Thread Steve Loughran
On 17 Mar 2016, at 17:46, Dan Burkert > wrote: Looks like it uses a Hadoop equivalent internally, though, so I'll look into using that. Good tip about timeouts, thanks. Dont think that's actually tagged as @Public, but it would upset too many

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Daniel Siegmann
Hi Nick, Thanks again for your help with this. Did you create a ticket in JIRA for investigating sparse models in LR and / or multivariate summariser? If so, can you give me the issue key(s)? If not, would you like me to create these tickets? I'm going to look into this some more and see if I

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread chrismattmann
So, my comment here is that any code *cannot* be removed from an Apache project if there is a VETO issued which so far I haven't seen, though maybe Marcelo can clarify that. However if a VETO was issued, then the code cannot be removed and must be put back. Anyone can fork anything our license

Re: Spark build with scala-2.10 fails ?

2016-03-19 Thread Yin Yang
The build was broken as of this morning. Created PR: https://github.com/apache/spark/pull/11787 On Wed, Mar 16, 2016 at 11:46 PM, Jeff Zhang wrote: > Anyone can pass the spark build with scala-2.10 ? > > > [info] Compiling 475 Scala sources and 78 Java sources to >

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Luciano Resende
If the intention is to actually decouple and give a life of it's own to these connectors, I would have expected that they would still be hosted as different git repositories inside Apache even tough users will not really see much difference as they would still be mirrored in GitHub. This makes it