Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi Reynold,

Is there any way to know when an executor will no longer have any tasks?
It seems to me there is no timeout which is appropriate that is long enough
to ensure that no more tasks will be scheduled on the executor, and short
enough to be appropriate to wait on during an interactive shell shutdown.

- Dan

On Wed, Mar 16, 2016 at 2:40 PM, Reynold Xin  wrote:

> Maybe just add a watch dog thread and closed the connection upon some
> timeout?
>
>
> On Wednesday, March 16, 2016, Dan Burkert  wrote:
>
>> Hi all,
>>
>> I'm working on the Spark connector for Apache Kudu, and I've run into an
>> issue that is a bit beyond my Spark knowledge. The Kudu connector
>> internally holds an open connection to the Kudu cluster
>> 
>>  which
>> internally holds a Netty context with non-daemon threads. When using the
>> Spark shell with the Kudu connector, exiting the shell via -D causes
>> the shell to hang, and a thread dump reveals it's waiting for these
>> non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
>> client does not do the trick, as it seems that the shutdown hooks are not
>> fired on -D.
>>
>> I see that there is an internal Spark API for handling shutdown
>> ,
>> is there something similar available for cleaning up external data sources?
>>
>> - Dan
>>
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran

Spark has hit one of the enternal problems of OSS projects, one hit by: ant, 
maven, hadoop, ... anything with a plugin model.

Take in the plugin: you're in control, but also down for maintenance

Leave out the plugin: other people can maintain it, be more agile, etc.

But you've lost control, and you can't even manage the links. Here I think 
maven suffered the most by keeping stuff in codehaus; migrating off there is 
still hard —not only did they lose the links: they lost the JIRA.

Maven's relationship with codehaus was very tightly coupled, lots of committers 
on both; I don't know how that relationship was handled at a higher level.


On 17 Mar 2016, at 20:51, Hari Shreedharan 
> wrote:

I have worked with various ASF projects for 4+ years now. Sure, ASF projects 
can delete code as they feel fit. But this is the first time I have really seen 
code being "moved out" of a project without discussion. I am sure you can do 
this without violating ASF policy, but the explanation for that would be 
convoluted (someone decided to make a copy and then the ASF project deleted 
it?).

+1 for discussion. Dev changes should -> dev list; PMC for process in general. 
Don't think the ASF will overlook stuff like that.

Might want to raise this issue on the next broad report


FWIW, it may be better to just see if you can have committers to work on these 
projects: recruit the people and say 'please, only work in this area —for now". 
That gets developers on your team, which is generally considered a metric of 
health in a project.

Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a 
focus on extras with their own support channel.


Also, moving the code out would break compatibility. AFAIK, there is no way to 
push org.apache.* artifacts directly to maven central. That happens via 
mirroring from the ASF maven repos. Even if it you could somehow directly push 
the artifacts to mvn, you really can push to org.apache.* groups only if you 
are part of the repo and acting as an agent of that project (which in this case 
would be Apache Spark). Once you move the code out, even a committer/PMC member 
would not be representing the ASF when pushing the code. I am not sure if there 
is a way to fix this issue.




This topic has cropped up in the general context of third party repos 
publishing artifacts with org.apache names but vendor specfic suffixes (e.g 
org.apache.hadoop/hadoop-common.5.3-cdh.jar

Some people were pretty unhappy about this, but the conclusion reached was 
"maven doesn't let you do anything else and still let downstream people use 
it". Futhermore, as all ASF releases are nominally the source releases *not the 
binaries*, you can look at the POMs and say "we've released source code 
designed to publish artifacts to repos —this is 'use as intended'.

People are also free to cut their own full project distributions, etc, etc. For 
example, I stick up the binaries of Windows builds independent of the ASF 
releases; these were originally just those from HDP on windows installs, now I 
check out the commit of the specific ASF release on a windows 2012 VM, do the 
build, copy the binaries. Free for all to use. But I do suspect that the ASF 
legal protections get a bit blurred here. These aren't ASF binaries, but 
binaries built directly from unmodified ASF releases.

In contrast to sticking stuff into a github repo, the moved artifacts cannot be 
published as org.apache artfacts on maven central. That's non-negotiable as far 
as the ASF are concerned. The process for releasing ASF artifacts there goes 
downstream of the ASF public release process: you stage the artifacts, they are 
part of the vote process, everything with org.apache goes through it.

That said: there is nothing to stop a set of shell org.apache artifacts being 
written which do nothing but contain transitive dependencies on artifacts in 
different groups, such as org.spark-project. The shells would be released by 
the ASF; they pull in the new stuff. And, therefore, it'd be possible to build 
a spark-assembly with the files. (I'm ignoring a loop in the build DAG here, 
playing with git submodules would let someone eliminate this by adding the 
removed libraries under a modified project.

I think there might some issues related to package names; you could make a case 
for having public APIs with the original names —they're the API, after all, and 
that's exactly what Apache Harmony did with the java.* packages.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
> wrote:
I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Hari Shreedharan
I have worked with various ASF projects for 4+ years now. Sure, ASF
projects can delete code as they feel fit. But this is the first time I
have really seen code being "moved out" of a project without discussion. I
am sure you can do this without violating ASF policy, but the explanation
for that would be convoluted (someone decided to make a copy and then the
ASF project deleted it?).

Also, moving the code out would break compatibility. AFAIK, there is no way
to push org.apache.* artifacts directly to maven central. That happens via
mirroring from the ASF maven repos. Even if it you could somehow directly
push the artifacts to mvn, you really can push to org.apache.* groups only
if you are part of the repo and acting as an agent of that project (which
in this case would be Apache Spark). Once you move the code out, even a
committer/PMC member would not be representing the ASF when pushing the
code. I am not sure if there is a way to fix this issue.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
wrote:

> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it
> wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
> wrote:
> > Why would a PMC vote be necessary on every code deletion?
> >
> > There was a Jira and pull request discussion about the submodules that
> > have been removed so far.
> >
> > https://issues.apache.org/jira/browse/SPARK-13843
> >
> > There's another ongoing one about Kafka specifically
> >
> > https://issues.apache.org/jira/browse/SPARK-13877
> >
> >
> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
> wrote:
> >>
> >> I was not aware of a discussion in Dev list about this - agree with
> most of
> >> the observations.
> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>
> >> Regards
> >> Mridul
> >>
> >>
> >>
> >> On Thursday, March 17, 2016, Marcelo Vanzin 
> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> Recently a lot of the streaming backends were moved to a separate
> >>> project on github and removed from the main Spark repo.
> >>>
> >>> While I think the idea is great, I'm a little worried about the
> >>> execution. Some concerns were already raised on the bug mentioned
> >>> above, but I'd like to have a more explicit discussion about this so
> >>> things don't fall through the cracks.
> >>>
> >>> Mainly I have three concerns.
> >>>
> >>> i. Ownership
> >>>
> >>> That code used to be run by the ASF, but now it's hosted in a github
> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>> problematic.
> >>>
> >>> ii. Governance
> >>>
> >>> Similar to the above; who has commit access to the above repos? Will
> >>> all the Spark committers, present and future, have commit access to
> >>> all of those repos? Are they still going to be considered part of
> >>> Spark and have release management done through the Spark community?
> >>>
> >>>
> >>> For both of the questions above, why are they not turned into
> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>> a mechanism to do that, without the need to keep the code in the main
> >>> Spark repo, right?
> >>>
> >>> iii. Usability
> >>>
> >>> This is another thing I don't see discussed. For Scala-based code
> >>> things don't change much, I guess, if the artifact names don't change
> >>> (another reason to keep things in the ASF?), but what about python?
> >>> How are pyspark users expected to get that code going forward, since
> >>> it's not in Spark's pyspark.zip anymore?
> >>>
> >>>
> >>> Is there an easy way of keeping these things within the ASF Spark
> >>> project? I think that would be better for everybody.
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Spark build with scala-2.10 fails ?

2016-03-19 Thread Jeff Zhang
Anyone can pass the spark build with scala-2.10 ?


[info] Compiling 475 Scala sources and 78 Java sources to
/Users/jzhang/github/spark/core/target/scala-2.10/classes...
[error]
/Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:30:
object ShuffleServiceHeartbeat is not a member of package
org.apache.spark.network.shuffle.protocol.mesos
[error] import
org.apache.spark.network.shuffle.protocol.mesos.{RegisterDriver,
ShuffleServiceHeartbeat}
[error]^
[error]
/Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:87:
not found: type ShuffleServiceHeartbeat
[error] def unapply(h: ShuffleServiceHeartbeat): Option[String] =
Some(h.getAppId)
[error]^
[error]
/Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:83:
value getHeartbeatTimeoutMs is not a member of
org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver
[error]   Some((r.getAppId, new AppState(r.getHeartbeatTimeoutMs,
System.nanoTime(
[error]^
[error]
/Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:451:
too many arguments for method registerDriverWithShuffleService: (x$1:
String, x$2: Int)Unit
[error]   .registerDriverWithShuffleService(
[error]^
[error] four errors found
[error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s]
-- 
Best Regards

Jeff Zhang


Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
There is no way to really know that, because users might run queries at any
given point.

BTW why can't your threads be just daemon threads?



On Wed, Mar 16, 2016 at 3:29 PM, Dan Burkert  wrote:

> Hi Reynold,
>
> Is there any way to know when an executor will no longer have any tasks?
> It seems to me there is no timeout which is appropriate that is long enough
> to ensure that no more tasks will be scheduled on the executor, and short
> enough to be appropriate to wait on during an interactive shell shutdown.
>
> - Dan
>
> On Wed, Mar 16, 2016 at 2:40 PM, Reynold Xin  wrote:
>
>> Maybe just add a watch dog thread and closed the connection upon some
>> timeout?
>>
>>
>> On Wednesday, March 16, 2016, Dan Burkert  wrote:
>>
>>> Hi all,
>>>
>>> I'm working on the Spark connector for Apache Kudu, and I've run into an
>>> issue that is a bit beyond my Spark knowledge. The Kudu connector
>>> internally holds an open connection to the Kudu cluster
>>> 
>>>  which
>>> internally holds a Netty context with non-daemon threads. When using the
>>> Spark shell with the Kudu connector, exiting the shell via -D causes
>>> the shell to hang, and a thread dump reveals it's waiting for these
>>> non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
>>> client does not do the trick, as it seems that the shutdown hooks are not
>>> fired on -D.
>>>
>>> I see that there is an internal Spark API for handling shutdown
>>> ,
>>> is there something similar available for cleaning up external data sources?
>>>
>>> - Dan
>>>
>>
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Imran Rashid
On Thu, Mar 17, 2016 at 2:55 PM, Cody Koeninger  wrote:

> Why would a PMC vote be necessary on every code deletion?
>

Certainly PMC votes are not necessary on *every* code deletion.  I dont'
think there is a very clear rule on when such discussion is warranted, just
a soft expectation that committers understand which changes require more
discussion before getting merged.  I believe the only formal requirement
for a PMC vote is when there is a release.  But I think as a community we'd
much rather deal with these issues ahead of time, rather than having
contentious discussions around releases because some are strongly opposed
to changes that have already been merged.

I'm all for the idea of removing these modules in general (for all of the
reasons already mentioned), but it seems that there are important questions
about how the new packages get distributed and how they are managed that
merit further discussion.

I'm somewhat torn on the question of the sub-project vs independent, and
how its governed.  I think Steve has summarized the tradeoffs very well.  I
do want to emphasize, though, that if they are entirely external from the
ASF, the artifact ids and the package names must change at the very least.


Re: Various forks

2016-03-19 Thread Xiangrui Meng
We made that fork to hide package private classes/members in the generated
Java API doc. Otherwise, the Java API doc is very messy. The patch is to
map all private[*] to the default scope in the generated Java code.
However, this might not be the expected behavior for other packages. So it
didn't get merged into the official genjavadoc repo. The proposal is to
have a flag in genjavadoc settings to enable this mapping, but it was
delayed. This is the JIRA for this issue:
https://issues.apache.org/jira/browse/SPARK-7992. -Xiangrui

On Tue, Mar 15, 2016 at 10:50 AM Reynold Xin  wrote:

> +Xiangrui
>
> On Tue, Mar 15, 2016 at 10:24 AM, Sean Owen  wrote:
>
>> Picking up this old thread, since we have the same problem updating to
>> Scala 2.11.8
>>
>> https://github.com/apache/spark/pull/11681#issuecomment-196932777
>>
>> We can see the org.spark-project packages here:
>>
>> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.spark-project%22
>>
>> I've forgotten who maintains the custom fork builds, and I don't know
>> the reasons we needed a fork of genjavadoc. Is it still relevant?
>>
>> Heh, there's no plugin for 2.11.8 from the upstream project either anyway:
>>
>> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.typesafe.genjavadoc%22
>>
>> This may be blocked for now
>>
>> On Thu, Jun 25, 2015 at 2:18 PM, Iulian Dragoș
>>  wrote:
>> > Could someone point the source of the Spark-fork used to build
>> > genjavadoc-plugin? Even more important it would be to know the reasoning
>> > behind this fork.
>> >
>> > Ironically, this hinders my attempts at removing another fork, the Spark
>> > REPL fork (and the upgrade to Scala 2.11.7). See here. Since genjavadoc
>> is a
>> > compiler plugin, it is cross-compiled with the full Scala version,
>> meaning
>> > someone needs to publish a new version for 2.11.7.
>> >
>> > Ideally, we'd have a list of all forks maintained by the Spark project.
>> I
>> > know about:
>> >
>> > - org.spark-project/akka
>> > - org.spark-project/hive
>> > - org.spark-project/genjavadoc-plugin
>> >
>> > Are there more? Where are they hosted, and what's the release process
>> around
>> > them?
>> >
>> > thanks,
>> > iulian
>> >
>> > --
>> >
>> > --
>> > Iulian Dragos
>> >
>> > --
>> > Reactive Apps on the JVM
>> > www.typesafe.com
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi Steve,

I referenced the ShutdownHookManager in my original message, but it appears
to be an internal-only API.  Looks like it uses a Hadoop equivalent
internally, though, so I'll look into using that.  Good tip about timeouts,
thanks.

 - Dan

On Thu, Mar 17, 2016 at 5:02 AM, Steve Loughran 
wrote:

>
> On 16 Mar 2016, at 23:43, Dan Burkert  wrote:
>
> After further thought, I think following both of your suggestions- adding
> a shutdown hook and making the threads non-daemon- may have the result I'm
> looking for.  I'll check and see if there are other reasons not to use
> daemon threads in our networking internals.  More generally though, what do
> y'all think about having Spark shutdown or close RelationProviders once
> they are not needed?  Seems to me that RelationProviders will often be
> stateful objects with network and/or file resources.  I checked with the C*
> Spark connector, and they jump through a bunch of hoops to handle this
> issue, including shutdown hooks and a ref counted cache.
>
>
> I'd recommend using org.apache.spark.util.ShutdownHookManager as the
> shutdown hook mechanism; it gives you priority over shutdown , and is
> already used in the Yarn AM, DiskBlockManager and elsewhere
>
>
> One thing to be careful about in shutdown hooks is to shut down in a
> bounded time period even if you can't connect to the far end: do make sure
> there are timeouts on TCP connects  i've hit problems with Hadoop HDFS
> where, if the endpoint isn't configured correctly, the shutdown hook
> blocks, causing Control-C/kill  interrupts to appear to hang, and of
> course a second kill just deadlocks on the original sync. (To deal with
> that, I ended up recognising a 2nd Ctrl-C interrupt as a trigger for
> calling System.halt(), which bails out the JVM without trying to invoke
> those hooks
>
>
> - Dan
>
> On Wed, Mar 16, 2016 at 4:04 PM, Dan Burkert  wrote:
>
>> Thanks for the replies, responses inline:
>>
>> On Wed, Mar 16, 2016 at 3:36 PM, Reynold Xin  wrote:
>>
>>> There is no way to really know that, because users might run queries at
>>> any given point.
>>>
>>> BTW why can't your threads be just daemon threads?
>>>
>>
>> The bigger issue is that we require the Kudu client to be manually closed
>> so that it can do necessary cleanup tasks.  During shutdown the client
>> closes the non-daemon threads, but more importantly, it flushes any
>> outstanding batched writes to the server.
>>
>> On Wed, Mar 16, 2016 at 3:35 PM, Hamel Kothari 
>>  wrote:
>>
>>> Dan,
>>>
>>> You could probably just register a JVM shutdown hook yourself:
>>> https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread)
>>>
>>>
>>> This at least would let you close the connections when the application
>>> as a whole has completed (in standalone) or when your executors have been
>>> killed (in YARN). I think that's as close as you'll get to knowing when an
>>> executor will no longer have any tasks in the current state of the world.
>>>
>>
>> The Spark shell will not run shutdown hooks after a -D if there are
>> non-daemon threads running.  You can test this with the following input to
>> the shell:
>>
>> new Thread(new Runnable { override def run() = { while (true) {
>> println("running"); Thread.sleep(1) } } }).start()
>> Runtime.getRuntime.addShutdownHook(new Thread(new Runnable { override def
>> run() = println("shutdown fired") }))
>>
>> - Dan
>>
>>
>>
>>>
>>> On Wed, Mar 16, 2016 at 3:29 PM, Dan Burkert  wrote:
>>>
 Hi Reynold,

 Is there any way to know when an executor will no longer have any
 tasks?  It seems to me there is no timeout which is appropriate that is
 long enough to ensure that no more tasks will be scheduled on the executor,
 and short enough to be appropriate to wait on during an interactive shell
 shutdown.

 - Dan

 On Wed, Mar 16, 2016 at 2:40 PM, Reynold Xin 
 wrote:

> Maybe just add a watch dog thread and closed the connection upon some
> timeout?
>
>
> On Wednesday, March 16, 2016, Dan Burkert  wrote:
>
>> Hi all,
>>
>> I'm working on the Spark connector for Apache Kudu, and I've run into
>> an issue that is a bit beyond my Spark knowledge. The Kudu connector
>> internally holds an open connection to the Kudu cluster
>> 
>>  which
>> internally holds a Netty context with non-daemon threads. When using the
>> Spark shell with the Kudu connector, exiting the shell via -D 
>> causes
>> the shell to hang, and a thread dump reveals it's waiting for these
>> non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
>> client does not do 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger
There's a difference between "without discussion" and "without as much
discussion as I would have liked to have a chance to notice it".
There are plenty of PRs that got merged before I noticed them that I
would rather have not gotten merged.

As far as group / artifact name compatibility, at least in the case of
Kafka we need different artifact names anyway, and people are going to
have to make changes to their build files for spark 2.0 anyway.   As
far as keeping the actual classes in org.apache.spark to not break
code despite the group name being different, I don't know whether that
would be enforced by maven central, just looked at as poor taste, or
ASF suing for trademark violation :)

For people who would rather the problem be solved with official asf
subprojects, which committers are volunteering to help do that work?
Reynold already said he doesn't want to mess with that overhead.

I'm fine with continuing to help work on the Kafka integration
wherever it ends up, I'd just like the color of the bikeshed to get
decided so we can build a decent bike...


On Thu, Mar 17, 2016 at 3:51 PM, Hari Shreedharan
 wrote:
> I have worked with various ASF projects for 4+ years now. Sure, ASF projects
> can delete code as they feel fit. But this is the first time I have really
> seen code being "moved out" of a project without discussion. I am sure you
> can do this without violating ASF policy, but the explanation for that would
> be convoluted (someone decided to make a copy and then the ASF project
> deleted it?).
>
> Also, moving the code out would break compatibility. AFAIK, there is no way
> to push org.apache.* artifacts directly to maven central. That happens via
> mirroring from the ASF maven repos. Even if it you could somehow directly
> push the artifacts to mvn, you really can push to org.apache.* groups only
> if you are part of the repo and acting as an agent of that project (which in
> this case would be Apache Spark). Once you move the code out, even a
> committer/PMC member would not be representing the ASF when pushing the
> code. I am not sure if there is a way to fix this issue.
>
>
> Thanks,
> Hari
>
> On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
> wrote:
>>
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it
>> wrong !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
>> wrote:
>> > Why would a PMC vote be necessary on every code deletion?
>> >
>> > There was a Jira and pull request discussion about the submodules that
>> > have been removed so far.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13843
>> >
>> > There's another ongoing one about Kafka specifically
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13877
>> >
>> >
>> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
>> > wrote:
>> >>
>> >> I was not aware of a discussion in Dev list about this - agree with
>> >> most of
>> >> the observations.
>> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
>> >>
>> >> Regards
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Thursday, March 17, 2016, Marcelo Vanzin 
>> >> wrote:
>> >>>
>> >>> Hello all,
>> >>>
>> >>> Recently a lot of the streaming backends were moved to a separate
>> >>> project on github and removed from the main Spark repo.
>> >>>
>> >>> While I think the idea is great, I'm a little worried about the
>> >>> execution. Some concerns were already raised on the bug mentioned
>> >>> above, but I'd like to have a more explicit discussion about this so
>> >>> things don't fall through the cracks.
>> >>>
>> >>> Mainly I have three concerns.
>> >>>
>> >>> i. Ownership
>> >>>
>> >>> That code used to be run by the ASF, but now it's hosted in a github
>> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> >>> problematic.
>> >>>
>> >>> ii. Governance
>> >>>
>> >>> Similar to the above; who has commit access to the above repos? Will
>> >>> all the Spark committers, present and future, have commit access to
>> >>> all of those repos? Are they still going to be considered part of
>> >>> Spark and have release management done through the Spark community?
>> >>>
>> >>>
>> >>> For both of the questions above, why are they not turned into
>> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> >>> a mechanism to do that, without the need to keep the code in the main
>> >>> Spark repo, right?
>> >>>
>> >>> 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
On Thu, Mar 17, 2016 at 12:01 PM, Cody Koeninger  wrote:
> i.  An ASF project can clearly decide that some of its code is no
> longer worth maintaining and delete it.  This isn't really any
> different. It's still apache licensed so ultimately whoever wants the
> code can get it.

Absolutely. But I don't remember this being discussed either way. Was
the intention, as you mention later, just to decouple the release of
those components from the main Spark release, or to completely disown
that code?

If the latter, is the ASF ok with it still retaining the current
package and artifact names? Changing those would break backwards
compatibility. Which is why I believe that keeping them as a
sub-project, even if their release cadence is much slower, would be a
better solution for both developers and users.

> ii.  I think part of the rationale is to not tie release management to
> Spark, so it can proceed on a schedule that makes sense.  I'm fine
> with helping out with release management for the Kafka subproject, for
> instance.  I agree that practical governance questions need to be
> worked out.
>
> iii.  How is this any different from how python users get access to
> any other third party Spark package?

True, but that requires the modules to be published somewhere, not
just to live as a bunch of .py files in a gitbub repo. Basically, I'm
worried that there's work to be done to keep those modules working in
this new environment - how to build, test, and publish things, remove
potential uses of internal Spark APIs, just to cite a couple of
things.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
On Fri, Mar 18, 2016 at 10:09 AM, Jean-Baptiste Onofré  
wrote:
> a project can have multiple repos: it's what we have in ServiceMix, in
> Karaf.
> For the *-extra on github, if the code has been in the ASF, the PMC members
> have to vote to move the code on *-extra.

That's good to know. To me that sounds like the best solution.

I've heard that top-level projects have some requirements with regards
to have active development, and these components probably will not see
that much activity. And top-level does sound like too much bureaucracy
for this.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
Maybe just add a watch dog thread and closed the connection upon some
timeout?

On Wednesday, March 16, 2016, Dan Burkert  wrote:

> Hi all,
>
> I'm working on the Spark connector for Apache Kudu, and I've run into an
> issue that is a bit beyond my Spark knowledge. The Kudu connector
> internally holds an open connection to the Kudu cluster
> 
>  which
> internally holds a Netty context with non-daemon threads. When using the
> Spark shell with the Kudu connector, exiting the shell via -D causes
> the shell to hang, and a thread dump reveals it's waiting for these
> non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
> client does not do the trick, as it seems that the shutdown hooks are not
> fired on -D.
>
> I see that there is an internal Spark API for handling shutdown
> ,
> is there something similar available for cleaning up external data sources?
>
> - Dan
>


SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Hello all,

Recently a lot of the streaming backends were moved to a separate
project on github and removed from the main Spark repo.

While I think the idea is great, I'm a little worried about the
execution. Some concerns were already raised on the bug mentioned
above, but I'd like to have a more explicit discussion about this so
things don't fall through the cracks.

Mainly I have three concerns.

i. Ownership

That code used to be run by the ASF, but now it's hosted in a github
repo owned not by the ASF. That sounds a little sub-optimal, if not
problematic.

ii. Governance

Similar to the above; who has commit access to the above repos? Will
all the Spark committers, present and future, have commit access to
all of those repos? Are they still going to be considered part of
Spark and have release management done through the Spark community?


For both of the questions above, why are they not turned into
sub-projects of Spark and hosted on the ASF repos? I believe there is
a mechanism to do that, without the need to keep the code in the main
Spark repo, right?

iii. Usability

This is another thing I don't see discussed. For Scala-based code
things don't change much, I guess, if the artifact names don't change
(another reason to keep things in the ASF?), but what about python?
How are pyspark users expected to get that code going forward, since
it's not in Spark's pyspark.zip anymore?


Is there an easy way of keeping these things within the ASF Spark
project? I think that would be better for everybody.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
No, I didn't yet - feel free to create a JIRA.



On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann 
wrote:

> Hi Nick,
>
> Thanks again for your help with this. Did you create a ticket in JIRA for
> investigating sparse models in LR and / or multivariate summariser? If so,
> can you give me the issue key(s)? If not, would you like me to create these
> tickets?
>
> I'm going to look into this some more and see if I can figure out how to
> implement these fixes.
>
> ~Daniel Siegmann
>
> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath 
> wrote:
>
>> Also adding dev list in case anyone else has ideas / views.
>>
>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath 
>> wrote:
>>
>>> Thanks for the feedback.
>>>
>>> I think Spark can certainly meet your use case when your data size
>>> scales up, as the actual model dimension is very small - you will need to
>>> use those indexers or some other mapping mechanism.
>>>
>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>> is supported but I have to check that). That will help use spark models in
>>> serving environments.
>>>
>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>> also a ticket for multivariate summariser (though I don't think in practice
>>> there will be much to gain).
>>>
>>>
>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
 Thanks for the pointer to those indexers, those are some good examples.
 A good way to go for the trainer and any scoring done in Spark. I will
 definitely have to deal with scoring in non-Spark systems though.

 I think I will need to scale up beyond what single-node liblinear can
 practically provide. The system will need to handle much larger sub-samples
 of this data (and other projects might be larger still). Additionally, the
 system needs to train many models in parallel (hyper-parameter optimization
 with n-fold cross-validation, multiple algorithms, different sets of
 features).

 Still, I suppose we'll have to consider whether Spark is the best
 system for this. For now though, my job is to see what can be achieved with
 Spark.



 On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> Ok, I think I understand things better now.
>
> For Spark's current implementation, you would need to map those
> features as you mention. You could also use say StringIndexer ->
> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
> the mapping and training (e.g.
> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
> Pipeline supports persistence.
>
> But it depends on your scoring use case too - a Spark pipeline can be
> saved and then reloaded, but you need all of Spark dependencies in your
> serving app which is often not ideal. If you're doing bulk scoring 
> offline,
> then it may suit.
>
> Honestly though, for that data size I'd certainly go with something
> like Liblinear :) Spark will ultimately scale better with # training
> examples for very large scale problems. However there are definitely
> limitations on model dimension and sparse weight vectors currently. There
> are potential solutions to these but they haven't been implemented as yet.
>
> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
> daniel.siegm...@teamaol.com> wrote:
>
>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Would you mind letting us know the # training examples in the
>>> datasets? Also, what do your features look like? Are they text, 
>>> categorical
>>> etc? You mention that most rows only have a few features, and all rows
>>> together have a few 10,000s features, yet your max feature value is 20
>>> million. How are your constructing your feature vectors to get a 20 
>>> million
>>> size? The only realistic way I can see this situation occurring in 
>>> practice
>>> is with feature hashing (HashingTF).
>>>
>>
>> The sub-sample I'm currently training on is about 50K rows, so ...
>> small.
>>
>> The features causing this issue are numeric (int) IDs for ... lets
>> call it "Thing". For each Thing in the record, we set the feature
>> Thing.id to a value of 1.0 in our vector (which is of course a
>> SparseVector). I'm not sure how IDs are generated for Things, but
>> they can be large numbers.
>>
>> The largest Thing ID is around 20 million, so that ends up being the
>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>> IDs in this data. The mean number of features per record in what I'm

Re: pull request template

2016-03-19 Thread Bryan Cutler
+1 on Marcelo's comments.  It would be nice not to pollute commit messages
with the  instructions because some people might forget to remove them.
Nobody has suggested removing the template.

On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley 
wrote:
> +1 for keeping the template
>
> I figure any template will require conscientiousness & enforcement.
>
> On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen  wrote:
>>
>> The template is a great thing as it gets instructions even more right
>> in front of people.
>>
>> Another idea is to just write a checklist of items, like "did you
>> describe your changes? did you test? etc." with instructions to delete
>> the text and replace with a description. This keeps the boilerplate
>> titles out of the commit message.
>>
>> The special character and post processing just takes that a step further.
>>
>> On Sat, Mar 12, 2016 at 1:31 AM, Marcelo Vanzin 
>> wrote:
>> > Hey all,
>> >
>> > Just wanted to ask: how do people like this new template?
>> >
>> > While I think it's great to have instructions for people to write
>> > proper commit messages, I think the current template has a few
>> > downsides.
>> >
>> > - I tend to write verbose commit messages already when I'm preparing a
>> > PR. Now when I open the PR I have to edit the summary field to remove
>> > all the boilerplate.
>> > - The template ends up in the commit messages, and sometimes people
>> > forget to remove even the instructions.
>> >
>> > Instead, what about changing the template a bit so that it just has
>> > instructions prepended with some character, and have those lines
>> > removed by the merge_spark_pr.py script? We could then even throw in a
>> > link to the wiki as Sean suggested since it won't end up in the final
>> > commit messages.
>> >
>> >
>> > On Fri, Feb 19, 2016 at 11:53 AM, Reynold Xin 
>> > wrote:
>> >> We can add that too - just need to figure out a good way so people
>> >> don't
>> >> leave a lot of the unnecessary "guideline" messages in the template.
>> >>
>> >> The contributing guide is great, but unfortunately it is not as
>> >> noticeable
>> >> and is often ignored. It's good to have this full-fledged contributing
>> >> guide, and then have a very lightweight version of that in the form of
>> >> templates to force contributors to think about all the important
>> >> aspects
>> >> outlined in the contributing guide.
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 19, 2016 at 2:36 AM, Sean Owen  wrote:
>> >>>
>> >>> All that seems fine. All of this is covered in the contributing wiki,
>> >>> which is linked from CONTRIBUTING.md (and should be from the
>> >>> template), but people don't seem to bother reading it. I don't mind
>> >>> duplicating some key points, and even a more explicit exhortation to
>> >>> read the whole wiki, before considering opening a PR. We spend way
too
>> >>> much time asking people to fix things they should have taken 60
>> >>> seconds to do correctly in the first place.
>> >>>
>> >>> On Fri, Feb 19, 2016 at 10:33 AM, Iulian Dragoș
>> >>>  wrote:
>> >>> > It's a good idea. I would add in there the spec for the PR title. I
>> >>> > always
>> >>> > get wrong the order between Jira and component.
>> >>> >
>> >>> > Moreover, CONTRIBUTING.md is also lacking them. Any reason not to
>> >>> > add it
>> >>> > there? I can open PRs for both, but maybe you want to keep that
info
>> >>> > on
>> >>> > the
>> >>> > wiki instead.
>> >>> >
>> >>> > iulian
>> >>> >
>> >>> > On Thu, Feb 18, 2016 at 4:18 AM, Reynold Xin 
>> >>> > wrote:
>> >>> >>
>> >>> >> Github introduced a new feature today that allows projects to
>> >>> >> define
>> >>> >> templates for pull requests. I pushed a very simple template to
the
>> >>> >> repository:
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>> >>> >>
>> >>> >>
>> >>> >> Over time I think we can see how this works and perhaps add a
small
>> >>> >> checklist to the pull request template so contributors are
reminded
>> >>> >> every
>> >>> >> time they submit a pull request the important things to do in a
>> >>> >> pull
>> >>> >> request
>> >>> >> (e.g. having proper tests).
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ## What changes were proposed in this pull request?
>> >>> >>
>> >>> >> (Please fill in changes proposed in this fix)
>> >>> >>
>> >>> >>
>> >>> >> ## How was the this patch tested?
>> >>> >>
>> >>> >> (Please explain how this patch was tested. E.g. unit tests,
>> >>> >> integration
>> >>> >> tests, manual tests)
>> >>> >>
>> >>> >>
>> >>> >> (If this patch involves UI changes, please attach a screenshot;
>> >>> >> otherwise,
>> >>> >> remove this)
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> >
>> >>> > --
>> >>> > Iulian Dragos
>> >>> >
>> >>> > --
>> >>> > Reactive Apps on the JVM
>> >>> > 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan
I was not aware of a discussion in Dev list about this - agree with most of
the observations.
In addition, I did not see PMC signoff on moving (sub-)modules out.

Regards
Mridul


On Thursday, March 17, 2016, Marcelo Vanzin  wrote:

> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


PySpark API divergence + improving pandas interoperability

2016-03-19 Thread Wes McKinney
hi everyone,

I've recently gotten moving on solving some of the low-level data
interoperability problems between Python's NumPy-focused scientific
computing and data libraries like pandas and the rest of the big data
ecosystem, Spark being a very important part of that.

One of the major efforts here is creating a unified data access layer
for pandas users using Apache Arrow as the structured data exchange
medium (read more here:
http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created
https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow
"thunderbolt port"  (to make an analogy) to Spark for moving data from
Spark SQL to pandas much more efficiently than the current
serialization scheme. If anyone wants to be a partner in crime on
this, feel free to reach out! I'll be dropping the Arrow
memory<->pandas conversion code in the next couple weeks.

As I'm looking more at the implementation details and API design of
PySpark, I note that it has been intended to have near 1-1 parity with
the Scala API, enabling developers to jump between APIs without a lot
of cognitive dissonance (you lose type information in Python, but
c'est la vie). Much of PySpark appears to be wrapping Scala / Java API
calls with py4j (much as many Python libraries wrap C/C++ libraries in
an analogous fashion).

In the long run, I'm concerned this may become problematic as users'
expectations about the semantics of interacting with the data may not
be compatible with the behavior of the Spark Scala API (particularly
the API design and semantics of Spark SQL and Datasets). As the Spark
user base grows, so, too, will the user needs, particularly in the
more accessible APIs (Python / R). I expect the Scala users tend to be
a more sophisticated audience with a more software engineering /
computer science tilt.

With a "big picture" goal of bringing about a semantic convergence
between big data and small data in a certain subset of scalable
computations, I am curious what is the Spark development community's
attitude towards efforts to achieve 1-1 PySpark API parity (with a
slight API lag as new features show up strictly in Scala before in
Python), particularly in the strictly semantic realm of data
interactions (at the end of the day, code has to move around bits
someplace). Here is an illustrative, albeit somewhat trivial example
of what I'm talking about:

https://issues.apache.org/jira/browse/SPARK-13943

If closer semantic compatibility with existing software in R and
Python is not a high priority, that is a completely reasonable answer.

Another thought is treating PySpark as the place where the "rubber
meets the road" -- the point of contact for any Python developers
building applications with Spark. This would leave library developers
aiming to create higher level user experiences (e.g. emulating pandas
more closely) and thus use PySpark as an implementation tool that
users otherwise do not directly interact with. But this is seemingly
at odds with the efforts to make Spark DataFrames behave in an
pandas/R-like fashion.

The nearest analogue to this I would give is the relationship between
pandas and NumPy in the earlier days of pandas (version 0.7 and
earlier). pandas relies on NumPy data structures and many of its array
algorithms. Early on I was lightly criticized in the community for
creating pandas as a separate project rather than contributing patches
to NumPy, but over time it has proven to have been the right decision,
as domain specific needs can evolve in a decoupled way without onerous
API design compromises.

very best,
Wes

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
I tried again this morning :

$ wget
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
--2016-03-18 07:55:30--
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
Resolving s3.amazonaws.com... 54.231.19.163
...
$ tar zxf spark-1.6.1-bin-hadoop2.6.tgz

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust 
wrote:

> Patrick reuploaded the artifacts, so it should be fixed now.
> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" 
> wrote:
>
>> Looks like the other packages may also be corrupt. I’m getting the same
>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Nick
>> ​
>>
>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:
>>
>>> On Linux, I got:
>>>
>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> gzip: stdin: unexpected end of file
>>> tar: Unexpected EOF in archive
>>> tar: Unexpected EOF in archive
>>> tar: Error is not recoverable: exiting now
>>>
>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>

 https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz

 Does anyone else have trouble unzipping this? How did this happen?

 What I get is:

 $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
 gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
 gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed

 Seems like a strange type of problem to come across.

 Nick
 ​

>>>
>>>


Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Pete Robbins
There are several open Jiras to add new Sinks

OpenTSDB https://issues.apache.org/jira/browse/SPARK-12194
StatsD https://issues.apache.org/jira/browse/SPARK-11574
Kafka https://issues.apache.org/jira/browse/SPARK-13392

Some have PRs from 2015 so I'm assuming there is not the desire to
integrate these into core Spark. Opening up the Sink/Source interfaces
would at least allow these to exist somewhere such as spark-packages
without having to pollute the o.a.s namespace


On Sat, 19 Mar 2016 at 13:05 Gerard Maas  wrote:

> +1
> On Mar 19, 2016 08:33, "Pete Robbins"  wrote:
>
>> This seems to me to be unnecessarily restrictive. These are very useful
>> extension points for adding 3rd party sources and sinks.
>>
>> I intend to make an Elasticsearch sink available on spark-packages but
>> this will require a single class, the sink, to be in the org.apache.spark
>> package tree. I could submit the package as a PR to the Spark codebase, and
>> I'd be happy to do that but it could be a completely separate add-on.
>>
>> There are similar issues with writing a 3rd party metrics source which
>> may not be of interest to the community at large so would probably not
>> warrant inclusion in the Spark codebase.
>>
>> Any thoughts?
>>
>


Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz

Does anyone else have trouble unzipping this? How did this happen?

What I get is:

$ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed

Seems like a strange type of problem to come across.

Nick
​


SparkContext.stop() takes too long to complete

2016-03-19 Thread Nezih Yigitbasi
Hi Spark experts,
I am using Spark 1.5.2 on YARN with dynamic allocation enabled. I see in
the driver/application master logs that the app is marked as SUCCEEDED and
then SparkContext stop is called. However, this stop sequence takes > 10
minutes to complete, and YARN resource manager kills the application master
as it didn’t receive a heartbeat within the last 10 minutes. The resource
manager then kills the application master. Any ideas about what may be
going on?

Here are the relevant logs:

*6/03/18 21:26:58 INFO yarn.ApplicationMaster: Final app status:
SUCCEEDED, exitCode: 0
16/03/18 21:26:58 INFO spark.SparkContext: Invoking stop() from
shutdown hook*16/03/18 21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}16/03/18 21:26:58 INFO
ui.SparkUI: Stopped Spark web UI at
http://10.143.240.240:5270616/03/18 21:27:58 INFO
cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s)
113516/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a
total number of 208 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1135.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1135 because it has
been idle for 60 seconds (new desired total will be 208)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 112316/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 207 executor(s).16/03/18 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Adam Kocoloski

> On Mar 19, 2016, at 8:32 AM, Steve Loughran  wrote:
> 
> 
>> On 18 Mar 2016, at 17:07, Marcelo Vanzin  wrote:
>> 
>> Hi Steve, thanks for the write up.
>> 
>> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  
>> wrote:
>>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs 
>>> to go through incubation. While normally its the incubator PMC which 
>>> sponsors/oversees the incubating project, it doesn't have to be the case: 
>>> the spark project can do it.
>>> 
>>> Also Apache Arrow managed to make it straight to toplevel without that 
>>> process. Given that the spark extras are already ASF source files, you 
>>> could try the same thing, add all the existing committers, then look for 
>>> volunteers to keep things.
>> 
>> Am I to understand from your reply that it's not possible for a single
>> project to have multiple repos?
>> 
> 
> 
> I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.
> 
> but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to 
> that repo, with the special exception of branches (encryption, ipv6) that 
> have their own committers.
> 
> oh, and I know that hadoop site is on SVN, as are other projects, just to 
> integrate with asf site publishing, so you can certainly have 1x git + 1 x svn
> 
> ASF won't normally let you have 1 repo with different bits of the tree having 
> different access rights, so you couldn't open up spark-extras to people with 
> less permissions/rights than others.
> 
> A separate repo will, separate issue tracking helps you isolate stuff

Multiple repositories per project are certainly allowed without incurring the 
overhead of a subproject; Cordova and CouchDB are two projects that have taken 
this approach:

https://github.com/apache?utf8=✓=cordova-
https://github.com/apache?utf8=✓=couchdb-

I believe Cordova also generates independent release artifacts in different 
cycles (e.g. cordova-ios releases independently from cordova-android).

If the goal is to enable a divergent set of committers to spark-extras then an 
independent project makes sense. If you’re just looking to streamline the main 
repo and decouple some of these other streaming “backends” from the normal 
release cycle then there are low impact ways to accomplish this inside a single 
Apache Spark project. Cheers,

Adam


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Shane Curcuru
Marcelo Vanzin wrote earlier:

> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.

Question: why was the code removed from the Spark repo?  What's the harm
in keeping it available here?

The ASF is perfectly happy if anyone wants to fork our code - that's one
of the core tenets of the Apache license.  You just can't take the name
or trademarks, so you may need to change some package names or the like.

So it's fine if some people want to work on the code outside the
project.  But it's puzzling as to why the Spark PMC shouldn't keep the
code in the project as well, even if it might not have the same release
cycles or whatnot.

- Shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Jean-Baptiste Onofré

Hi Marcelo,

a project can have multiple repos: it's what we have in ServiceMix, in 
Karaf.


For the *-extra on github, if the code has been in the ASF, the PMC 
members have to vote to move the code on *-extra.


Regards
JB

On 03/18/2016 06:07 PM, Marcelo Vanzin wrote:

Hi Steve, thanks for the write up.

On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  wrote:

If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to 
go through incubation. While normally its the incubator PMC which 
sponsors/oversees the incubating project, it doesn't have to be the case: the 
spark project can do it.

Also Apache Arrow managed to make it straight to toplevel without that process. 
Given that the spark extras are already ASF source files, you could try the 
same thing, add all the existing committers, then look for volunteers to keep 
things.


Am I to understand from your reply that it's not possible for a single
project to have multiple repos?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Gerard Maas
+1
On Mar 19, 2016 08:33, "Pete Robbins"  wrote:

> This seems to me to be unnecessarily restrictive. These are very useful
> extension points for adding 3rd party sources and sinks.
>
> I intend to make an Elasticsearch sink available on spark-packages but
> this will require a single class, the sink, to be in the org.apache.spark
> package tree. I could submit the package as a PR to the Spark codebase, and
> I'd be happy to do that but it could be a completely separate add-on.
>
> There are similar issues with writing a 3rd party metrics source which may
> not be of interest to the community at large so would probably not warrant
> inclusion in the Spark codebase.
>
> Any thoughts?
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran

> On 18 Mar 2016, at 22:24, Marcelo Vanzin  wrote:
> 
> On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann  wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though maybe
>> Marcelo can clarify that.
> 
> No, my intention was not to veto the change. I'm actually for the
> removal of components if the community thinks they don't add much to
> the project. (I'm also not sure I can even veto things, not being a
> PMC member.)
> 
> I mainly wanted to know what was the path forward for those components
> because, with Cloudera's hat on, we care about one of them (streaming
> integration with flume), and we'd prefer if that code remained under
> the ASF umbrella in some way.
> 

I'd be supportive of a spark-extras project; it'd actually be  place to keep 
stuff I've worked on 
 -the yarn ATS 1/1.5 integration
 -that mutant hive JAR which has the consistent kryo dependency and different 
shadings

... etc

There's also the fact that the twitter streaming is a common example to play 
with, flume is popular in places too.

If you want to set up a new incubator with a goal of graduating fast, I'd help. 
As a key metric of getting out of incubator is active development, you just 
need to "recruit" contributors and keep them engaged.




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran

> On 18 Mar 2016, at 17:07, Marcelo Vanzin  wrote:
> 
> Hi Steve, thanks for the write up.
> 
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  
> wrote:
>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs 
>> to go through incubation. While normally its the incubator PMC which 
>> sponsors/oversees the incubating project, it doesn't have to be the case: 
>> the spark project can do it.
>> 
>> Also Apache Arrow managed to make it straight to toplevel without that 
>> process. Given that the spark extras are already ASF source files, you could 
>> try the same thing, add all the existing committers, then look for 
>> volunteers to keep things.
> 
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
> 


I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.

but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to that 
repo, with the special exception of branches (encryption, ipv6) that have their 
own committers.

oh, and I know that hadoop site is on SVN, as are other projects, just to 
integrate with asf site publishing, so you can certainly have 1x git + 1 x svn

ASF won't normally let you have 1 repo with different bits of the tree having 
different access rights, so you couldn't open up spark-extras to people with 
less permissions/rights than others.

A separate repo will, separate issue tracking helps you isolate stuff

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran

> On 17 Mar 2016, at 21:33, Marcelo Vanzin  wrote:
> 
> Hi Reynold, thanks for the info.
> 
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
>> If one really feels strongly that we should go through all the overhead to
>> setup an ASF subproject for these modules that won't work with the new
>> structured streaming, and want to spearhead to setup separate repos
>> (preferably one subproject per connector), CI, separate JIRA, governance,
>> READMEs, voting, we can discuss that. Until then, I'd keep the github option
>> open because IMHO it is what works the best for end users (including
>> discoverability, issue tracking, release publishing, ...).
> 
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
> 
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
> 


If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to 
go through incubation. While normally its the incubator PMC which 
sponsors/oversees the incubating project, it doesn't have to be the case: the 
spark project can do it.


Also Apache Arrow managed to make it straight to toplevel without that process. 
Given that the spark extras are already ASF source files, you could try the 
same thing, add all the existing committers, then look for volunteers to keep 
things.


You'd get
 -a JIRA entry of your own, easy to reassign bugs from SPARK to SPARK-EXTRAS
 -a bit of git
 -ability to set up builds on ASF Jenkins. Regression testing against spark 
nightlies would be invaluable here.
 -the ability to stage and publish through ASF Nexus


-Steve

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
On Linux, I got:

$ tar zxf spark-1.6.1-bin-hadoop2.6.tgz

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>
> Does anyone else have trouble unzipping this? How did this happen?
>
> What I get is:
>
> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>
> Seems like a strange type of problem to come across.
>
> Nick
> ​
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Sean Owen
Code can be removed from an ASF project.
That code can live on elsewhere (in accordance with the license)

It can't be presented as part of the official ASF project, like any
other 3rd party project
The package name certainly must change from org.apache.spark

I don't know of a protocol, but common sense dictates a good-faith
effort to offer equivalent access to the code (e.g. interested
committers should probably be repo owners too.)

This differs from "any other code deletion" in that there's an intent
to keep working on the code but outside the project.
More discussion -- like this one -- would have been useful beforehand
but nothing's undoable

Backwards-compatibility is not a good reason for things, because we're
talking about Spark 2.x, and we're already talking about distributing
the code differently.

Is the reason for this change decoupling releases? or changing governance?
Seems like the former, but we don't actually need the latter to achieve that.
There's an argument for a new repo, but this is not an argument for
moving X out of the project per se

I'm sure doing this in the ASF is more overhead, but if changing
governance is a non-goal, there's no choice.
Convenience can't trump that.

Kafka integration is clearly more important than the others.
It seems to need to stay within the project.
However this still leaves a packaging problem to solve, that might
need a new repo. This is orthgonal.


Here's what I think:

1. Leave the moved modules outside the project entirely
  (why not Kinesis though? that one was not made clear)
2. Change package names and make sure it's clearly presented as external
3. Add any committers that want to be repo owners as owners
4. Keep Kafka within the project
5. Add some subproject within the current project as needed to
accomplish distribution goals

On Thu, Mar 17, 2016 at 6:14 PM, Marcelo Vanzin  wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Thanks for the replies, responses inline:

On Wed, Mar 16, 2016 at 3:36 PM, Reynold Xin  wrote:

> There is no way to really know that, because users might run queries at
> any given point.
>
> BTW why can't your threads be just daemon threads?
>

The bigger issue is that we require the Kudu client to be manually closed
so that it can do necessary cleanup tasks.  During shutdown the client
closes the non-daemon threads, but more importantly, it flushes any
outstanding batched writes to the server.

On Wed, Mar 16, 2016 at 3:35 PM, Hamel Kothari 
 wrote:

> Dan,
>
> You could probably just register a JVM shutdown hook yourself:
> https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread
> )
>
> This at least would let you close the connections when the application as
> a whole has completed (in standalone) or when your executors have been
> killed (in YARN). I think that's as close as you'll get to knowing when an
> executor will no longer have any tasks in the current state of the world.
>

The Spark shell will not run shutdown hooks after a -D if there are
non-daemon threads running.  You can test this with the following input to
the shell:

new Thread(new Runnable { override def run() = { while (true) {
println("running"); Thread.sleep(1) } } }).start()
Runtime.getRuntime.addShutdownHook(new Thread(new Runnable { override def
run() = println("shutdown fired") }))

- Dan



>
> On Wed, Mar 16, 2016 at 3:29 PM, Dan Burkert  wrote:
>
>> Hi Reynold,
>>
>> Is there any way to know when an executor will no longer have any tasks?
>> It seems to me there is no timeout which is appropriate that is long enough
>> to ensure that no more tasks will be scheduled on the executor, and short
>> enough to be appropriate to wait on during an interactive shell shutdown.
>>
>> - Dan
>>
>> On Wed, Mar 16, 2016 at 2:40 PM, Reynold Xin  wrote:
>>
>>> Maybe just add a watch dog thread and closed the connection upon some
>>> timeout?
>>>
>>>
>>> On Wednesday, March 16, 2016, Dan Burkert  wrote:
>>>
 Hi all,

 I'm working on the Spark connector for Apache Kudu, and I've run into
 an issue that is a bit beyond my Spark knowledge. The Kudu connector
 internally holds an open connection to the Kudu cluster
 
  which
 internally holds a Netty context with non-daemon threads. When using the
 Spark shell with the Kudu connector, exiting the shell via -D causes
 the shell to hang, and a thread dump reveals it's waiting for these
 non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
 client does not do the trick, as it seems that the shutdown hooks are not
 fired on -D.

 I see that there is an internal Spark API for handling shutdown
 ,
 is there something similar available for cleaning up external data sources?

 - Dan

>>>
>>
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Note the non-kafka bug was filed right before the change was pushed.
So there really wasn't any discussion before the decision was made to
remove that code.

I'm just trying to merge both discussions here in the list where it's
a little bit more dynamic than bug updates that end up getting lost in
the noise.

On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread satyajit vegesna
Hi,

Scala version:2.11.7(had to upgrade the scala verison to enable case
clasess to accept more than 22 parameters.)

Spark version:1.6.1.

PFB pom.xml

Getting below error when trying to setup spark on intellij IDE,

16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
Exception in thread "main" java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class at
org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
at org.apache.spark.SparkContext.(SparkContext.scala:298) at
com.examples.testSparkPost$.main(testSparkPost.scala:27) at
com.examples.testSparkPost.main(testSparkPost.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
by: java.lang.ClassNotFoundException:
scala.collection.GenTraversableOnce$class at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more

pom.xml:

http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd;>
4.0.0
StreamProcess
StreamProcess
0.0.1-SNAPSHOT
${project.artifactId}
This is a boilerplate maven project to start using
Spark in Scala
2010


1.6
1.6
UTF-8
2.10

2.11.7





cloudera-repo-releases
https://repository.cloudera.com/artifactory/repo/




src/main/scala
src/test/scala



maven-assembly-plugin


package

single





jar-with-dependencies





net.alchim31.maven
scala-maven-plugin
3.2.2



compile
testCompile




-dependencyfile

${project.build.directory}/.scala_dependencies








maven-assembly-plugin
2.4.1


jar-with-dependencies




make-assembly
package

single








org.scala-lang
scala-library
${scala.version}


org.mongodb.mongo-hadoop
mongo-hadoop-core
1.4.2


javax.servlet
servlet-api




org.mongodb
mongodb-driver
3.2.2


javax.servlet
servlet-api




org.mongodb
mongodb-driver
3.2.2


javax.servlet
servlet-api




org.apache.spark
spark-streaming_2.10
1.6.1


org.apache.spark
spark-core_2.10
1.6.1


org.apache.spark
spark-sql_2.10
1.6.1


org.apache.hadoop
hadoop-hdfs
2.6.0


org.apache.hadoop
hadoop-auth
2.6.0


org.apache.hadoop

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation:
https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna 
wrote:

>
>
> Hi,
>
> Scala version:2.11.7(had to upgrade the scala verison to enable case
> clasess to accept more than 22 parameters.)
>
> Spark version:1.6.1.
>
> PFB pom.xml
>
> Getting below error when trying to setup spark on intellij IDE,
>
> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
> Exception in thread "main" java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class at
> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
> com.examples.testSparkPost.main(testSparkPost.scala) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
> by: java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class at
> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
> java.security.AccessController.doPrivileged(Native Method) at
> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>
> pom.xml:
>
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> StreamProcess
> StreamProcess
> 0.0.1-SNAPSHOT
> ${project.artifactId}
> This is a boilerplate maven project to start using Spark in 
> Scala
> 2010
>
> 
> 1.6
> 1.6
> UTF-8
> 2.10
> 
> 2.11.7
> 
>
> 
> 
> 
> cloudera-repo-releases
> https://repository.cloudera.com/artifactory/repo/
> 
> 
>
> 
> src/main/scala
> src/test/scala
> 
> 
> 
> maven-assembly-plugin
> 
> 
> package
> 
> single
> 
> 
> 
> 
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> 
> net.alchim31.maven
> scala-maven-plugin
> 3.2.2
> 
> 
> 
> compile
> testCompile
> 
> 
> 
> 
> -dependencyfile
> 
> ${project.build.directory}/.scala_dependencies
> 
> 
> 
> 
> 
>
> 
> 
> maven-assembly-plugin
> 2.4.1
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> make-assembly
> package
> 
> single
> 
> 
> 
> 
> 
> 
> 
> 
> org.scala-lang
> scala-library
> ${scala.version}
> 
> 
> org.mongodb.mongo-hadoop
> mongo-hadoop-core
> 1.4.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
>

graceful shutdown in external data sources

2016-03-19 Thread Dan Burkert
Hi all,

I'm working on the Spark connector for Apache Kudu, and I've run into an
issue that is a bit beyond my Spark knowledge. The Kudu connector
internally holds an open connection to the Kudu cluster

which
internally holds a Netty context with non-daemon threads. When using the
Spark shell with the Kudu connector, exiting the shell via -D causes
the shell to hang, and a thread dump reveals it's waiting for these
non-daemon threads.  Registering a JVM shutdown hook to close the Kudu
client does not do the trick, as it seems that the shutdown hooks are not
fired on -D.

I see that there is an internal Spark API for handling shutdown
,
is there something similar available for cleaning up external data sources?

- Dan


Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames

2016-03-19 Thread Tim Hunter
Hello all,

I would like to bring your attention to a small project to integrate
TensorFlow with Apache Spark, called TensorFrames. With this library, you
can map, reduce or aggregate numerical data stored in Spark dataframes
using TensorFlow computation graphs. It is published as a Spark package and
available in this github repository:

https://github.com/tjhunter/tensorframes

More detailed examples can be found in the user guide:

https://github.com/tjhunter/tensorframes/wiki/TensorFrames-user-guide

This is a technical preview at this point. I am looking forward to some
feedback about the current python API if some adventurous users want to try
it out. Of course, contributions are most welcome, for example to fix bugs
or to add support for platforms other than linux-x86_64. It should support
all the most common inputs in dataframes (dense tensors of rank 0, 1, 2 of
ints, longs, floats and doubles).

Please note that this is not an endorsement by Databricks of TensorFlow, or
any other deep learning framework for that matter. If users want to use
deep learning in production, some other more robust solutions are
available: SparkNet, CaffeOnSpark, DeepLearning4J.

Best regards


Tim Hunter


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin
Also, just wanted to point out something:

On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
> Thanks for initiating this discussion. I merged the pull request because it
> was unblocking another major piece of work for Spark 2.0: not requiring
> assembly jars

While I do agree that's more important, the streaming assemblies
weren't really blocking that work. The fact that there are still
streaming assemblies in the build kinda proves that point. :-)

I even filed a task to look at getting rid of the streaming assemblies
(SPARK-13575; just the assemblies though, not the code) but while
working on it found it would be more complicated than expected, and
decided against it given that it didn't really affect work on the
other assemblies.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself,
so you'll have to change your deps to use the 2.11 versions of Spark.
e.g. spark-streaming_2.10 -> spark-streaming_2.11.

On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen  wrote:

> See the instructions in the Spark documentation:
> https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
>
> On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna <
> satyajit.apas...@gmail.com> wrote:
>
>>
>>
>> Hi,
>>
>> Scala version:2.11.7(had to upgrade the scala verison to enable case
>> clasess to accept more than 22 parameters.)
>>
>> Spark version:1.6.1.
>>
>> PFB pom.xml
>>
>> Getting below error when trying to setup spark on intellij IDE,
>>
>> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class at
>> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
>> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
>> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
>> com.examples.testSparkPost.main(testSparkPost.scala) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
>> by: java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>>
>> pom.xml:
>>
>> http://maven.apache.org/POM/4.0.0; 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> http://maven.apache.org/maven-v4_0_0.xsd;>
>> 4.0.0
>> StreamProcess
>> StreamProcess
>> 0.0.1-SNAPSHOT
>> ${project.artifactId}
>> This is a boilerplate maven project to start using Spark in 
>> Scala
>> 2010
>>
>> 
>> 1.6
>> 1.6
>> UTF-8
>> 2.10
>> 
>> 2.11.7
>> 
>>
>> 
>> 
>> 
>> cloudera-repo-releases
>> https://repository.cloudera.com/artifactory/repo/
>> 
>> 
>>
>> 
>> src/main/scala
>> src/test/scala
>> 
>> 
>> 
>> maven-assembly-plugin
>> 
>> 
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>> jar-with-dependencies
>> 
>> 
>> 
>> 
>> 
>> net.alchim31.maven
>> scala-maven-plugin
>> 3.2.2
>> 
>> 
>> 
>> compile
>> testCompile
>> 
>> 
>> 
>> 
>> -dependencyfile
>> 
>> ${project.build.directory}/.scala_dependencies
>> 
>> 
>> 
>> 
>> 
>>
>> 
>> 
>> maven-assembly-plugin
>> 2.4.1
>> 
>> 
>> jar-with-dependencies
>> 
>> 
>> 
>> 
>> make-assembly
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> org.scala-lang
>> scala-library
>> ${scala.version}
>> 
>> 
>> org.mongodb.mongo-hadoop
>> mongo-hadoop-core
>> 1.4.2
>> 
>> 
>> javax.servlet
>> servlet-api
>> 
>> 
>> 

Re: pull request template

2016-03-19 Thread Reynold Xin
I think it'd make sense to have the merge script automatically remove some
parts of the template, if they were not removed by the contributor. That
seems trivial to do.


On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley 
wrote:

> +1 for keeping the template
>
> I figure any template will require conscientiousness & enforcement.
>
> On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen  wrote:
>
>> The template is a great thing as it gets instructions even more right
>> in front of people.
>>
>> Another idea is to just write a checklist of items, like "did you
>> describe your changes? did you test? etc." with instructions to delete
>> the text and replace with a description. This keeps the boilerplate
>> titles out of the commit message.
>>
>> The special character and post processing just takes that a step further.
>>
>> On Sat, Mar 12, 2016 at 1:31 AM, Marcelo Vanzin 
>> wrote:
>> > Hey all,
>> >
>> > Just wanted to ask: how do people like this new template?
>> >
>> > While I think it's great to have instructions for people to write
>> > proper commit messages, I think the current template has a few
>> > downsides.
>> >
>> > - I tend to write verbose commit messages already when I'm preparing a
>> > PR. Now when I open the PR I have to edit the summary field to remove
>> > all the boilerplate.
>> > - The template ends up in the commit messages, and sometimes people
>> > forget to remove even the instructions.
>> >
>> > Instead, what about changing the template a bit so that it just has
>> > instructions prepended with some character, and have those lines
>> > removed by the merge_spark_pr.py script? We could then even throw in a
>> > link to the wiki as Sean suggested since it won't end up in the final
>> > commit messages.
>> >
>> >
>> > On Fri, Feb 19, 2016 at 11:53 AM, Reynold Xin 
>> wrote:
>> >> We can add that too - just need to figure out a good way so people
>> don't
>> >> leave a lot of the unnecessary "guideline" messages in the template.
>> >>
>> >> The contributing guide is great, but unfortunately it is not as
>> noticeable
>> >> and is often ignored. It's good to have this full-fledged contributing
>> >> guide, and then have a very lightweight version of that in the form of
>> >> templates to force contributors to think about all the important
>> aspects
>> >> outlined in the contributing guide.
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 19, 2016 at 2:36 AM, Sean Owen  wrote:
>> >>>
>> >>> All that seems fine. All of this is covered in the contributing wiki,
>> >>> which is linked from CONTRIBUTING.md (and should be from the
>> >>> template), but people don't seem to bother reading it. I don't mind
>> >>> duplicating some key points, and even a more explicit exhortation to
>> >>> read the whole wiki, before considering opening a PR. We spend way too
>> >>> much time asking people to fix things they should have taken 60
>> >>> seconds to do correctly in the first place.
>> >>>
>> >>> On Fri, Feb 19, 2016 at 10:33 AM, Iulian Dragoș
>> >>>  wrote:
>> >>> > It's a good idea. I would add in there the spec for the PR title. I
>> >>> > always
>> >>> > get wrong the order between Jira and component.
>> >>> >
>> >>> > Moreover, CONTRIBUTING.md is also lacking them. Any reason not to
>> add it
>> >>> > there? I can open PRs for both, but maybe you want to keep that
>> info on
>> >>> > the
>> >>> > wiki instead.
>> >>> >
>> >>> > iulian
>> >>> >
>> >>> > On Thu, Feb 18, 2016 at 4:18 AM, Reynold Xin 
>> >>> > wrote:
>> >>> >>
>> >>> >> Github introduced a new feature today that allows projects to
>> define
>> >>> >> templates for pull requests. I pushed a very simple template to the
>> >>> >> repository:
>> >>> >>
>> >>> >>
>> >>> >>
>> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>> >>> >>
>> >>> >>
>> >>> >> Over time I think we can see how this works and perhaps add a small
>> >>> >> checklist to the pull request template so contributors are reminded
>> >>> >> every
>> >>> >> time they submit a pull request the important things to do in a
>> pull
>> >>> >> request
>> >>> >> (e.g. having proper tests).
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ## What changes were proposed in this pull request?
>> >>> >>
>> >>> >> (Please fill in changes proposed in this fix)
>> >>> >>
>> >>> >>
>> >>> >> ## How was the this patch tested?
>> >>> >>
>> >>> >> (Please explain how this patch was tested. E.g. unit tests,
>> integration
>> >>> >> tests, manual tests)
>> >>> >>
>> >>> >>
>> >>> >> (If this patch involves UI changes, please attach a screenshot;
>> >>> >> otherwise,
>> >>> >> remove this)
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> >
>> >>> > --
>> >>> > Iulian Dragos
>> >>> >
>> >>> > --
>> >>> > Reactive Apps on the JVM
>> >>> > www.typesafe.com
>> >>> >
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Marcelo
>> >
>> > 

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
OK cool. I'll test the hadoop-2.6 package and check back here if it's still
broken.

Just curious: How did those packages all get corrupted (if we know)? Seems
like a strange thing to happen.
2016년 3월 17일 (목) 오전 11:57, Michael Armbrust 님이 작성:

> Patrick reuploaded the artifacts, so it should be fixed now.
> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" 
> wrote:
>
>> Looks like the other packages may also be corrupt. I’m getting the same
>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Nick
>> ​
>>
>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:
>>
>>> On Linux, I got:
>>>
>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> gzip: stdin: unexpected end of file
>>> tar: Unexpected EOF in archive
>>> tar: Unexpected EOF in archive
>>> tar: Error is not recoverable: exiting now
>>>
>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>

 https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz

 Does anyone else have trouble unzipping this? How did this happen?

 What I get is:

 $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
 gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
 gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed

 Seems like a strange type of problem to come across.

 Nick
 ​

>>>
>>>


Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Mridul Muralidharan
We use it in executors to get to :
a) spark conf (for getting to hadoop config in map doing custom
writing of side-files)
b) Shuffle manager (to get shuffle reader)

Not sure if there are alternative ways to get to these.

Regards,
Mridul

On Wed, Mar 16, 2016 at 2:52 PM, Reynold Xin  wrote:
> Any objections? Please articulate your use case. SparkEnv is a weird one
> because it was documented as "private" but not marked as so in class
> visibility.
>
>  * NOTE: This is not intended for external use. This is exposed for Shark
> and may be made private
>  *   in a future release.
>
>
> I do see Hive using it to get the config variable. That can probably be
> propagated through other means.
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Reynold Xin
Thanks for initiating this discussion. I merged the pull request because it
was unblocking another major piece of work for Spark 2.0: not requiring
assembly jars, which is arguably a lot more important than sources that are
less frequently used. I take full responsibility for that.

I think it's inaccurate to call them "backend" because it makes these
things sound a lot more serious, when in reality they are a bunch of
connectors to less frequently used streaming data sources (e.g. mqtt,
flume). But that's not that important here.

Another important factor is that over time, with the development of
structure streaming, we'd provide a new API for streaming sources that
unifies the way to connect arbitrary sources, and as a result all of these
sources need to be rewritten anyway. This is similar to the RDD ->
DataFrame transition for data sources, although it was initially painful,
but in the long run provides much better experience for end-users because
they only need to learn a single API for all sources, and it becomes
trivial to transition from one source to another, without actually
impacting business logic.

So the truth is that in the long run, the existing connectors will be
replaced by new ones, and they have been causing minor issues here and
there in the code base. Now issues like these are never black and white. By
moving them out, we'd require users to at least change the maven coordinate
in their build file (although things can still be made binary and source
compatible). So I made the call and asked the contributor to keep Kafka and
Kinesis in, because those are the most widely used (and could be more
contentious), and move everything else out.

I have personally done enough data sources or 3rd party packages for Spark
on github that I can setup a github repo with CI and maven publishing in
just under an hour. I do not expect a lot of changes to these packages
because the APIs have been fairly stable. So the thing I was optimizing for
was to minimize the time we need to spent on these packages given the
(expected) low activity and the shift to focus on structured streaming, and
also minimize the chance to break user apps to provide the best user
experience.

Github repo seems the simplest choice to me. I also made another decision
to provide separate repos (and thus issue trackers) on github for these
packages. The reason is that these connectors have very disjoint
communities. For example, the community that care about mqtt is likely very
different from the community that care about akka. It is much easier to
track all of these.

Logistics wise -- things are still in flux. I think it'd make a lot of
sense to give existing Spark committers (or at least the ones that have
contributed to streaming) write access to the github repos. IMHO, it is not
in any of the major Spark contributing organizations' strategic interest to
"own" these projects, especially considering most of the activities will
switch to structured streaming.

If one really feels strongly that we should go through all the overhead to
setup an ASF subproject for these modules that won't work with the new
structured streaming, and want to spearhead to setup separate repos
(preferably one subproject per connector), CI, separate JIRA, governance,
READMEs, voting, we can discuss that. Until then, I'd keep the github
option open because IMHO it is what works the best for end users (including
discoverability, issue tracking, release publishing, ...).






On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger  wrote:

> Anyone can fork apache licensed code.  Committers can approve pull
> requests that delete code from asf repos.  Because those two things
> happen near each other in time, it's somehow a process violation?
>
> I think the discussion would be better served by concentrating on how
> we're going to solve the problem and move forward.
>
> On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan 
> wrote:
> > I am not referring to code edits - but to migrating submodules and
> > code currently in Apache Spark to 'outside' of it.
> > If I understand correctly, assets from Apache Spark are being moved
> > out of it into thirdparty external repositories - not owned by Apache.
> >
> > At a minimum, dev@ discussion (like this one) should be initiated.
> > As PMC is responsible for the project assets (including code), signoff
> > is required for it IMO.
> >
> > More experienced Apache members might be opine better in case I got it
> wrong !
> >
> >
> > Regards,
> > Mridul
> >
> >
> > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
> wrote:
> >> Why would a PMC vote be necessary on every code deletion?
> >>
> >> There was a Jira and pull request discussion about the submodules that
> >> have been removed so far.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13843
> >>
> >> There's another ongoing one about Kafka specifically
> >>
> >> 

Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Pete Robbins
This seems to me to be unnecessarily restrictive. These are very useful
extension points for adding 3rd party sources and sinks.

I intend to make an Elasticsearch sink available on spark-packages but this
will require a single class, the sink, to be in the org.apache.spark
package tree. I could submit the package as a PR to the Spark codebase, and
I'd be happy to do that but it could be a completely separate add-on.

There are similar issues with writing a 3rd party metrics source which may
not be of interest to the community at large so would probably not warrant
inclusion in the Spark codebase.

Any thoughts?


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan
I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (including code), signoff
is required for it IMO.

More experienced Apache members might be opine better in case I got it wrong !


Regards,
Mridul


On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
Looks like the other packages may also be corrupt. I’m getting the same
error for the Spark 1.6.1 / Hadoop 2.4 package.

https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz

Nick
​

On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:

> On Linux, I got:
>
> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>
>> Does anyone else have trouble unzipping this? How did this happen?
>>
>> What I get is:
>>
>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>
>> Seems like a strange type of problem to come across.
>>
>> Nick
>> ​
>>
>
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Ted Yu
Same with hadoop 2.3 tar ball:

$ tar zxf spark-1.6.1-bin-hadoop2.3.tgz

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

On Wed, Mar 16, 2016 at 5:47 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Looks like the other packages may also be corrupt. I’m getting the same
> error for the Spark 1.6.1 / Hadoop 2.4 package.
>
>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>
> Nick
> ​
>
> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:
>
>> On Linux, I got:
>>
>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>
>> gzip: stdin: unexpected end of file
>> tar: Unexpected EOF in archive
>> tar: Unexpected EOF in archive
>> tar: Error is not recoverable: exiting now
>>
>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> Does anyone else have trouble unzipping this? How did this happen?
>>>
>>> What I get is:
>>>
>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>
>>> Seems like a strange type of problem to come across.
>>>
>>> Nick
>>> ​
>>>
>>
>>


Re: graceful shutdown in external data sources

2016-03-19 Thread Steve Loughran

On 17 Mar 2016, at 17:46, Dan Burkert 
> wrote:

 Looks like it uses a Hadoop equivalent internally, though, so I'll look into 
using that.  Good tip about timeouts, thanks.


Dont think that's actually tagged as @Public, but it would upset too many 
people if it broke, myself included


Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Daniel Siegmann
Hi Nick,

Thanks again for your help with this. Did you create a ticket in JIRA for
investigating sparse models in LR and / or multivariate summariser? If so,
can you give me the issue key(s)? If not, would you like me to create these
tickets?

I'm going to look into this some more and see if I can figure out how to
implement these fixes.

~Daniel Siegmann

On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath 
wrote:

> Also adding dev list in case anyone else has ideas / views.
>
> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath 
> wrote:
>
>> Thanks for the feedback.
>>
>> I think Spark can certainly meet your use case when your data size scales
>> up, as the actual model dimension is very small - you will need to use
>> those indexers or some other mapping mechanism.
>>
>> There is ongoing work for Spark 2.0 to make it easier to use models
>> outside of Spark - also see PMML export (I think mllib logistic regression
>> is supported but I have to check that). That will help use spark models in
>> serving environments.
>>
>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>> also a ticket for multivariate summariser (though I don't think in practice
>> there will be much to gain).
>>
>>
>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>> daniel.siegm...@teamaol.com> wrote:
>>
>>> Thanks for the pointer to those indexers, those are some good examples.
>>> A good way to go for the trainer and any scoring done in Spark. I will
>>> definitely have to deal with scoring in non-Spark systems though.
>>>
>>> I think I will need to scale up beyond what single-node liblinear can
>>> practically provide. The system will need to handle much larger sub-samples
>>> of this data (and other projects might be larger still). Additionally, the
>>> system needs to train many models in parallel (hyper-parameter optimization
>>> with n-fold cross-validation, multiple algorithms, different sets of
>>> features).
>>>
>>> Still, I suppose we'll have to consider whether Spark is the best system
>>> for this. For now though, my job is to see what can be achieved with Spark.
>>>
>>>
>>>
>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 Ok, I think I understand things better now.

 For Spark's current implementation, you would need to map those
 features as you mention. You could also use say StringIndexer ->
 OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
 the mapping and training (e.g.
 http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
 Pipeline supports persistence.

 But it depends on your scoring use case too - a Spark pipeline can be
 saved and then reloaded, but you need all of Spark dependencies in your
 serving app which is often not ideal. If you're doing bulk scoring offline,
 then it may suit.

 Honestly though, for that data size I'd certainly go with something
 like Liblinear :) Spark will ultimately scale better with # training
 examples for very large scale problems. However there are definitely
 limitations on model dimension and sparse weight vectors currently. There
 are potential solutions to these but they haven't been implemented as yet.

 On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
 daniel.siegm...@teamaol.com> wrote:

> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
> nick.pentre...@gmail.com> wrote:
>
>> Would you mind letting us know the # training examples in the
>> datasets? Also, what do your features look like? Are they text, 
>> categorical
>> etc? You mention that most rows only have a few features, and all rows
>> together have a few 10,000s features, yet your max feature value is 20
>> million. How are your constructing your feature vectors to get a 20 
>> million
>> size? The only realistic way I can see this situation occurring in 
>> practice
>> is with feature hashing (HashingTF).
>>
>
> The sub-sample I'm currently training on is about 50K rows, so ...
> small.
>
> The features causing this issue are numeric (int) IDs for ... lets
> call it "Thing". For each Thing in the record, we set the feature
> Thing.id to a value of 1.0 in our vector (which is of course a
> SparseVector). I'm not sure how IDs are generated for Things, but
> they can be large numbers.
>
> The largest Thing ID is around 20 million, so that ends up being the
> size of the vector. But in fact there are fewer than 10,000 unique Thing
> IDs in this data. The mean number of features per record in what I'm
> currently training against is 41, while the maximum for any given record
> was 1754.
>
> It is possible to map the features into a small set (just need to
> zipWithIndex), but this is undesirable because of the added complexity 
> (not
> 

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread chrismattmann
So, my comment here is that any code *cannot* be removed from an Apache
project if there is a VETO issued which so far I haven't seen, though maybe
Marcelo can clarify that.

However if a VETO was issued, then the code cannot be removed and must be
put back. Anyone can fork anything our license allows that, but the
community itself must steward the code and part of that is hearing
everyone's voice within that community before acting.

Cheers,
Chris



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-and-future-of-streaming-backends-tp16711p16749.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark build with scala-2.10 fails ?

2016-03-19 Thread Yin Yang
The build was broken as of this morning.

Created PR:
https://github.com/apache/spark/pull/11787

On Wed, Mar 16, 2016 at 11:46 PM, Jeff Zhang  wrote:

> Anyone can pass the spark build with scala-2.10 ?
>
>
> [info] Compiling 475 Scala sources and 78 Java sources to
> /Users/jzhang/github/spark/core/target/scala-2.10/classes...
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:30:
> object ShuffleServiceHeartbeat is not a member of package
> org.apache.spark.network.shuffle.protocol.mesos
> [error] import
> org.apache.spark.network.shuffle.protocol.mesos.{RegisterDriver,
> ShuffleServiceHeartbeat}
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:87:
> not found: type ShuffleServiceHeartbeat
> [error] def unapply(h: ShuffleServiceHeartbeat): Option[String] =
> Some(h.getAppId)
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:83:
> value getHeartbeatTimeoutMs is not a member of
> org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver
> [error]   Some((r.getAppId, new AppState(r.getHeartbeatTimeoutMs,
> System.nanoTime(
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:451:
> too many arguments for method registerDriverWithShuffleService: (x$1:
> String, x$2: Int)Unit
> [error]   .registerDriverWithShuffleService(
> [error]^
> [error] four errors found
> [error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s]
> --
> Best Regards
>
> Jeff Zhang
>


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Luciano Resende
If the intention is to actually decouple and give a life of it's own to
these connectors, I would have expected that they would still be hosted as
different git repositories inside Apache even tough users will not really
see much difference as they would still be mirrored in GitHub. This makes
it much easier on the legal departments of the upstream consumers and
customers as well because the code still follow the so well received and
trusted Apache Governance and Apache Release Policies. As for
implementation details, we can have multiple repositories if we see a lot
of fragmented releases, or a single "connectors" repository which in our
side would make administration more easily.

On Thu, Mar 17, 2016 at 2:33 PM, Marcelo Vanzin  wrote:

> Hi Reynold, thanks for the info.
>
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
> > If one really feels strongly that we should go through all the overhead
> to
> > setup an ASF subproject for these modules that won't work with the new
> > structured streaming, and want to spearhead to setup separate repos
> > (preferably one subproject per connector), CI, separate JIRA, governance,
> > READMEs, voting, we can discuss that. Until then, I'd keep the github
> option
> > open because IMHO it is what works the best for end users (including
> > discoverability, issue tracking, release publishing, ...).
>

Agree that there might be a little overhead, but there are ways to minimize
this, and I am sure there are volunteers willing to help in favor of having
a more unifying project. Breaking things into multiple projects, and having
to manage the matrix of supported versions will be hell worst overhead.


>
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
>
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
>
>
Subprojects or even if we send this back to incubator as "connectors
project" is better then public github per package in my opinion.



Now, if with this move is signalizing to customers that the Streaming API
as in 1.x is going away in favor the new structure streaming APIs , then I
guess this is a complete different discussion.


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/