Re: Is "mllib" no longer Experimental?

2015-10-14 Thread Patrick Wendell
I would tend to agree with this approach. We should audit all
@Experimenetal labels before the 1.6 release and clear them out when
appropriate.

- Patrick

On Wed, Oct 14, 2015 at 2:13 AM, Sean Owen  wrote:

> Someone asked, is "ML pipelines" stable? I said, no, most of the key
> classes are still marked @Experimental, which matches my expression that
> things may still be subject to change.
>
> But then, I see that MLlib classes, which are de facto not seeing much
> further work and no API change, are also mostly marked @Experimental. If,
> generally, no more significant work is going into MLlib classes, is it time
> to remove most or all of those labels, to keep it meaningful?
>
> Sean
>


Is "mllib" no longer Experimental?

2015-10-14 Thread Sean Owen
Someone asked, is "ML pipelines" stable? I said, no, most of the key
classes are still marked @Experimental, which matches my expression that
things may still be subject to change.

But then, I see that MLlib classes, which are de facto not seeing much
further work and no API change, are also mostly marked @Experimental. If,
generally, no more significant work is going into MLlib classes, is it time
to remove most or all of those labels, to keep it meaningful?

Sean


Re: Status of SBT Build

2015-10-14 Thread Patrick Wendell
Hi Jakob,

There is a temporary issue with the Scala 2.11 build in SBT. The problem is
this wasn't previously covered by our automated tests so it broke without
us knowing - this has been actively discussed on the dev list in the last
24 hours. I am trying to get it working in our test harness today.

In terms of fixing the underlying issues, I am not sure whether there is a
JIRA for it yet, but we should make one if not. Does anyone know?

- Patrick

On Wed, Oct 14, 2015 at 12:13 PM, Jakob Odersky  wrote:

> Hi everyone,
>
> I've been having trouble building Spark with SBT recently. Scala 2.11
> doesn't work and in all cases I get large amounts of warnings and even
> errors on tests.
>
> I was therefore wondering what the official status of spark with sbt is?
> Is it very new and still buggy or unmaintained and "falling to pieces"?
>
> In any case, I would be glad to help with any issues on setting up a clean
> and working build with sbt.
>
> thanks,
> --Jakob
>


Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-14 Thread Dibyendu Bhattacharya
Hi,

I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045)
to track the discussion but also mailing dev group for your opinion. There
are some discussions already happened in Jira and love to hear what others
think. You can directly comment against the Jira if you wish.

This kafka consumer is around for a while in spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see
many people started using it , I am now thinking of contributing back to
Apache Spark core project so that it can get better support ,visibility and
adoption.

Few Point about this consumer

*Why this is needed , and how I position this Consumer : *

This Consumer is NOT the replacement for existing DirectStream API.
DirectStream solves the problem around "Exactly Once" semantics and "Global
Ordering" of messages . But to achieve this DirectStream comes with an
overhead. The overhead of maintaining the offset externally
, limited parallelism while processing the RDD ( as the RDD partition is
same as Kafka Partition ), and higher latency while processing RDD ( as
messages are fetched when RDD is processed) . There are many who does not
want "Exact Once" and "Global Ordering" of messages, or ordering are
managed in external store ( say HBase),  and want more parallelism and
lower latency in their Streaming channel . At this point Spark does not
have a better fallback option available in terms of Receiver Based API.
Present Receiver Based API use Kafka High Level API which is low
performance and has serious issue. [For this reason Kafka is coming up with
new High Level Consumer API in 0.9]

The Consumer which I implemented is using the Kafka Low Level API which
gives more performance.  This consumer has built in fault tolerant features
for all failures recovery. This Consumer extended the code from Storm Kafka
Spout which is being around for some time and has matured over the years
and has all built in Kafka fault tolerant capabilities. This same Kafka
consumer for spark is being running in various production scenarios
presently and already being adopted by many in the spark community.

*Why Can't we fix existing Receiver based API in Spark* :

This is not possible unless you move to Kafka Low Level API . Or let wait
for Kafka 0.9 where they are re-writing the HighLevel Consumer API and
built another consumer for Kafka 0.9 customers .
This approach seems to be not good in my opinion. The Kafka Low Level API
which I used in my consumer ( and even DirectStream uses ) will not going
to be deprecated in near future. So if Kafka Consumer for Spark is using
Low Level API for Receiver based mode, that will make sure all Kafka
Customers who are presently in 0.8.x or who will use 0.9 , benefited form
this same API. This will give easier maintenance to manage single API for
any Kafka versions. Also this will make sure both Direct Stream and
Receiver mode utilize same Kafka API.

*Concerns around Low Level API Complexity*

Yes, implementing a reliable consumer using Kafka Low Level consumer API is
complex. But same has been done for Strom -Kafka Spout and has been stable
for quite some time. This consumer for Spark is battle tested in various
production loads and gives much better performance than existing Kafka
Consumers for Spark and has better fault tolerant approach than existing
Receiver based mode. I do not think having a complex code should be a major
concern to deny a stable and high performance consumer for community. I am
okay if anyone interested to benchmark against other Kafka Consumers for
Spark and do various fault testing to make sure what I am saying is correct.

*Why can't this consumer continue to be in Spark-Package ?*

This can be possible. But what I see , many customer who want to fallback
to receiver based mode as they may not need "Exact Once" semantics or
"Global Ordering" , seems to little tentative using a spark-package library
for their critical streaming pipeline. And they are forced to use faulty
and buggy Kafka High Level API based mode. This consumer being part of
Spark project will give much higher adoption and support from community.

*Some Major features around this consumer :*

This consumer is controlling the rate limit by maintaining the constant
Block size where as default rate limiting in other Spark consumers are done
by number of messages. This is an issue when Kafka has messages of
different sizes and there is no deterministic way to know the actual block
sizes and memory utilization if rate control done by number of messages.

This consumer has in-built PID controller which controls the Rate of
consumption again by modifying the block size and consume only that much
amount of messages needed from Kafka . In default Spark consumer , it
fetches chunk of messages and then apply throttle to control the rate.
Which can lead to excess I/O while consuming from Kafka.

There are other features in this Consumer which we can discuss at length
once we are 

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-14 Thread Ted Yu
Some old bits:

http://stackoverflow.com/questions/28162991/cant-run-spark-1-2-in-standalone-mode-on-mac
http://stackoverflow.com/questions/29412157/passing-hostname-to-netty

FYI

On Wed, Oct 14, 2015 at 7:10 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’m setting the Spark master address via the SPARK_MASTER_IP environment
> variable in spark-env.sh, like spark-ec2 does
> 
> .
>
> The funny thing is that Spark seems to accept this only if the value of
> SPARK_MASTER_IP is a DNS name and not an IP address.
>
> When I provide an IP address, I get errors in the log when starting the
> master:
>
> 15/10/15 01:47:31 ERROR NettyTransport: failed to bind to /54.210.XX.XX:7077, 
> shutting down Netty transport
>
> (XX is my redaction of the full IP address.)
>
> Am I misunderstanding something about how to use this environment variable?
>
> The spark-env.sh template indicates that either an IP address or a
> hostname should work
> ,
> but my testing shows that only hostnames work.
>
> Nick
> ​
>