Re: Spark 1.6.1

2016-02-02 Thread Mingyu Kim
Cool, thanks!

Mingyu

From:  Michael Armbrust 
Date:  Tuesday, February 2, 2016 at 10:48 AM
To:  Mingyu Kim 
Cc:  Romi Kuntsman , Hamel Kothari
, Ted Yu ,
"dev@spark.apache.org" , Punya Biswal
, Robert Kruszewski 
Subject:  Re: Spark 1.6.1

I'm waiting for a few last fixes to be merged.  Hoping to cut an RC in the
next few days.

On Tue, Feb 2, 2016 at 10:43 AM, Mingyu Kim  wrote:
> Hi all,
> 
> Is there an estimated timeline for 1.6.1 release? Just wanted to check how the
> release is coming along. Thanks!
> 
> Mingyu
> 
> From: Romi Kuntsman 
> Date: Tuesday, February 2, 2016 at 3:16 AM
> To: Michael Armbrust 
> Cc: Hamel Kothari , Ted Yu ,
> "dev@spark.apache.org" 
> Subject: Re: Spark 1.6.1
> 
> Hi Michael,
> What about the memory leak bug?
> https://issues.apache.org/jira/browse/SPARK-11293
>  rowse_SPARK-2D11293=CwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=e
> nnQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=tI8Pjfii7XuX3Suiky8mImD7S5BoAq6fg
> OSdJ7rt2Wo=R_B4rDig-0VPE5Q4YeLEs2HUIg-A8St1OtDjD89d_zY=>
> Even after the memory rewrite in 1.6.0, it still happens in some cases.
> Will it be fixed for 1.6.1?
> Thanks,
> 
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com
>  =izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YUrUulcY
> QoV8giPASqXB84=tI8Pjfii7XuX3Suiky8mImD7S5BoAq6fgOSdJ7rt2Wo=Z4TgGF0h7oetD4O
> 6u_3qjrYbe0ZtW2g_In7V8tkByPg=>
> 
> On Mon, Feb 1, 2016 at 9:59 PM, Michael Armbrust 
> wrote:
>> We typically do not allow changes to the classpath in maintenance releases.
>> 
>> On Mon, Feb 1, 2016 at 8:16 AM, Hamel Kothari  wrote:
>>> I noticed that the Jackson dependency was bumped to 2.5 in master for
>>> something spark-streaming related. Is there any reason that this upgrade
>>> can't be included with 1.6.1?
>>> 
>>> According to later comments on this thread:
>>> https://issues.apache.org/jira/browse/SPARK-8332
>>> >> _browse_SPARK-2D8332=CwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&
>>> r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=tI8Pjfii7XuX3Suiky8mImD7S5Bo
>>> Aq6fgOSdJ7rt2Wo=i-ngQFHfxOmgkYx_5NiCaHdIlm7zi2LYpUxm9I3RfR4=>  and my
>>> personal experience using with Spark with Jackson 2.5 hasn't caused any
>>> issues but it does have some useful new features. It should be fully
>>> backwards compatible according to the Jackson folks.
>>> 
>>> On Mon, Feb 1, 2016 at 10:29 AM Ted Yu  wrote:
 SPARK-12624 has been resolved.
 According to Wenchen, SPARK-12783 is fixed in 1.6.0 release.
 
 Are there other blockers for Spark 1.6.1 ?
 
 Thanks
 
 On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust 
 wrote:
> Hey All, 
> 
> While I'm not aware of any critical issues with 1.6.0, there are several
> corner cases that users are hitting with the Dataset API that are fixed in
> branch-1.6.  As such I'm considering a 1.6.1 release.
> 
> At the moment there are only two critical issues targeted for 1.6.1:
>  - SPARK-12624 - When schema is specified, we should treat undeclared
> fields as null (in Python)
>  - SPARK-12783 - Dataset map serialization error
> 
> When these are resolved I'll likely begin the release process.  If there
> are any other issues that we should wait for please contact me.
> 
> Michael
 
>> 
> 





smime.p7s
Description: S/MIME cryptographic signature


Re: Spark 1.6.1

2016-02-02 Thread Mingyu Kim
Hi all,

Is there an estimated timeline for 1.6.1 release? Just wanted to check how
the release is coming along. Thanks!

Mingyu

From:  Romi Kuntsman 
Date:  Tuesday, February 2, 2016 at 3:16 AM
To:  Michael Armbrust 
Cc:  Hamel Kothari , Ted Yu ,
"dev@spark.apache.org" 
Subject:  Re: Spark 1.6.1

Hi Michael,
What about the memory leak bug?
https://issues.apache.org/jira/browse/SPARK-11293

Even after the memory rewrite in 1.6.0, it still happens in some cases.
Will it be fixed for 1.6.1?
Thanks,

Romi Kuntsman, Big Data Engineer
http://www.totango.com


On Mon, Feb 1, 2016 at 9:59 PM, Michael Armbrust 
wrote:
> We typically do not allow changes to the classpath in maintenance releases.
> 
> On Mon, Feb 1, 2016 at 8:16 AM, Hamel Kothari  wrote:
>> I noticed that the Jackson dependency was bumped to 2.5 in master for
>> something spark-streaming related. Is there any reason that this upgrade
>> can't be included with 1.6.1?
>> 
>> According to later comments on this thread:
>> https://issues.apache.org/jira/browse/SPARK-8332
>> > browse_SPARK-2D8332=CwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=
>> ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=tI8Pjfii7XuX3Suiky8mImD7S5BoAq6
>> fgOSdJ7rt2Wo=i-ngQFHfxOmgkYx_5NiCaHdIlm7zi2LYpUxm9I3RfR4=>  and my
>> personal experience using with Spark with Jackson 2.5 hasn't caused any
>> issues but it does have some useful new features. It should be fully
>> backwards compatible according to the Jackson folks.
>> 
>> On Mon, Feb 1, 2016 at 10:29 AM Ted Yu  wrote:
>>> SPARK-12624 has been resolved.
>>> According to Wenchen, SPARK-12783 is fixed in 1.6.0 release.
>>> 
>>> Are there other blockers for Spark 1.6.1 ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust 
>>> wrote:
 Hey All, 
 
 While I'm not aware of any critical issues with 1.6.0, there are several
 corner cases that users are hitting with the Dataset API that are fixed in
 branch-1.6.  As such I'm considering a 1.6.1 release.
 
 At the moment there are only two critical issues targeted for 1.6.1:
  - SPARK-12624 - When schema is specified, we should treat undeclared
 fields as null (in Python)
  - SPARK-12783 - Dataset map serialization error
 
 When these are resolved I'll likely begin the release process.  If there
 are any other issues that we should wait for please contact me.
 
 Michael
>>> 
> 





smime.p7s
Description: S/MIME cryptographic signature


Re: Spark 1.6.1

2016-02-02 Thread Michael Armbrust
I'm waiting for a few last fixes to be merged.  Hoping to cut an RC in the
next few days.

On Tue, Feb 2, 2016 at 10:43 AM, Mingyu Kim  wrote:

> Hi all,
>
> Is there an estimated timeline for 1.6.1 release? Just wanted to check how
> the release is coming along. Thanks!
>
> Mingyu
>
> From: Romi Kuntsman 
> Date: Tuesday, February 2, 2016 at 3:16 AM
> To: Michael Armbrust 
> Cc: Hamel Kothari , Ted Yu ,
> "dev@spark.apache.org" 
> Subject: Re: Spark 1.6.1
>
> Hi Michael,
> What about the memory leak bug?
> https://issues.apache.org/jira/browse/SPARK-11293
> 
> Even after the memory rewrite in 1.6.0, it still happens in some cases.
> Will it be fixed for 1.6.1?
> Thanks,
>
> *Romi Kuntsman*, * Big Data Engineer*
> http://www.totango.com
> 
>
> On Mon, Feb 1, 2016 at 9:59 PM, Michael Armbrust 
> wrote:
>
>> We typically do not allow changes to the classpath in maintenance
>> releases.
>>
>> On Mon, Feb 1, 2016 at 8:16 AM, Hamel Kothari 
>> wrote:
>>
>>> I noticed that the Jackson dependency was bumped to 2.5 in master for
>>> something spark-streaming related. Is there any reason that this upgrade
>>> can't be included with 1.6.1?
>>>
>>> According to later comments on this thread:
>>> https://issues.apache.org/jira/browse/SPARK-8332
>>> 
>>> and my personal experience using with Spark with Jackson 2.5 hasn't caused
>>> any issues but it does have some useful new features. It should be fully
>>> backwards compatible according to the Jackson folks.
>>>
>>> On Mon, Feb 1, 2016 at 10:29 AM Ted Yu  wrote:
>>>
 SPARK-12624 has been resolved.
 According to Wenchen, SPARK-12783 is fixed in 1.6.0 release.

 Are there other blockers for Spark 1.6.1 ?

 Thanks

 On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Hey All,
>
> While I'm not aware of any critical issues with 1.6.0, there are
> several corner cases that users are hitting with the Dataset API that are
> fixed in branch-1.6.  As such I'm considering a 1.6.1 release.
>
> At the moment there are only two critical issues targeted for 1.6.1:
>  - SPARK-12624 - When schema is specified, we should treat undeclared
> fields as null (in Python)
>  - SPARK-12783 - Dataset map serialization error
>
> When these are resolved I'll likely begin the release process.  If
> there are any other issues that we should wait for please contact me.
>
> Michael
>


>>
>


Re: Spark 1.6.1

2016-02-02 Thread Michael Armbrust
>
> What about the memory leak bug?
> https://issues.apache.org/jira/browse/SPARK-11293
> Even after the memory rewrite in 1.6.0, it still happens in some cases.
> Will it be fixed for 1.6.1?
>

I think we have enough issues queued up that I would not hold the release
for that, but if there is a patch we should try and review it.  We can
always do 1.6.2 when more issues have been resolved.  Is this an actual
issue that is affecting a production workload or are we concerned about an
edge case?


Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
Hi Ben,

> My company uses Lamba to do simple data moving and processing using python
> scripts. I can see using Spark instead for the data processing would make it
> into a real production level platform.

That may be true. Spark has first class support for Python which
should make your life easier if you do go this route. Once you've
fleshed out your ideas I'm sure folks on this mailing list can provide
helpful guidance based on their real world experience with Spark.

> Does this pave the way into replacing
> the need of a pre-instantiated cluster in AWS or bought hardware in a
> datacenter?

In a word, no. SAMBA is designed to extend-not-replace the traditional
Spark computation and deployment model. At it's most basic, the
traditional Spark computation model distributes data and computations
across worker nodes in the cluster.

SAMBA simply allows some of those computations to be performed by AWS
Lambda rather than locally on your worker nodes. There are I believe a
number of potential benefits to using SAMBA in some circumstances:

1. It can help reduce some of the workload on your Spark cluster by
moving that workload onto AWS Lambda, an infrastructure on-demand
compute service.

2. It allows Spark applications written in Java or Scala to make use
of libraries and features offered by Python and JavaScript (Node.js)
today, and potentially, more libraries and features offered by
additional languages in the future as AWS Lambda language support
evolves.

3. It provides a simple, clean API for integration with REST APIs that
may be a benefit to Spark applications that form part of a broader
data pipeline or solution.

> If so, then this would be a great efficiency and make an easier
> entry point for Spark usage. I hope the vision is to get rid of all cluster
> management when using Spark.

You might find one of the hosted Spark platform solutions such as
Databricks or Amazon EMR that handle cluster management for you a good
place to start. At least in my experience, they got me up and running
without difficulty.

David

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Encrypting jobs submitted by the client

2016-02-02 Thread Ted Yu
For #1, a brief search landed the following:

core/src/main/scala/org/apache/spark/SparkConf.scala:
 DeprecatedConfig("spark.rpc", "2.0", "Not used any more.")
core/src/main/scala/org/apache/spark/SparkConf.scala:
 "spark.rpc.numRetries" -> Seq(
core/src/main/scala/org/apache/spark/SparkConf.scala:
 "spark.rpc.retry.wait" -> Seq(
core/src/main/scala/org/apache/spark/SparkConf.scala:
 "spark.rpc.askTimeout" -> Seq(
core/src/main/scala/org/apache/spark/SparkConf.scala:
 "spark.rpc.lookupTimeout" -> Seq(
core/src/main/scala/org/apache/spark/SparkConf.scala:
 "spark.rpc.message.maxSize" -> Seq(
core/src/main/scala/org/apache/spark/SparkConf.scala:
 name.startsWith("spark.rpc") ||

There doesn't seem to be RPC protection for stand alone mode.

On Tue, Feb 2, 2016 at 12:36 PM, eugene miretsky 
wrote:

> Thanks Steve!
> 1. spark-submit submitting the YARN app for launch?  That you get it if
> you turn hadoop IPC encruption on, by settingo
> hadoop.rpc.protection=privacy across the cluster.
> > That's what I meant: Is there something similar for stand alone or
> Mesos?
>
> 2. communications between spark driver and executor. that can use HTTPS
> > My understanding is that that you can use HTTPS for the jar server on
> the driver, and SASL for block transfer. Is there anything else I'm missing?
>
> Cheers,
> Eugene
>
>
> On Tue, Feb 2, 2016 at 7:46 AM, Steve Loughran 
> wrote:
>
>>
>> > On 1 Feb 2016, at 20:48, eugene miretsky 
>> wrote:
>> >
>> > Spark supports client authentication via shared secret or kerberos (on
>> YARN). However, the job itself is sent unencrypted over the network.  Is
>> there a way to encrypt the jobs the client submits to cluster?
>>
>>
>> define submission?
>
>
>> 1. spark-submit submitting the YARN app for launch?  That you get it if
>> you turn hadoop IPC encruption on, by settingo
>> hadoop.rpc.protection=privacy across the cluster.
>
> 2. communications between spark driver and executor. that can use HTTPS
>>
>> > The rational for this is very similar to  encrypting the HTTP file
>> server traffic - Jars may have sensitive data.
>> >
>> > Cheers,
>> > Eugene
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Encrypting jobs submitted by the client

2016-02-02 Thread eugene miretsky
Thanks Steve!
1. spark-submit submitting the YARN app for launch?  That you get it if you
turn hadoop IPC encruption on, by settingo hadoop.rpc.protection=privacy
across the cluster.
> That's what I meant: Is there something similar for stand alone or Mesos?

2. communications between spark driver and executor. that can use HTTPS
> My understanding is that that you can use HTTPS for the jar server on the
driver, and SASL for block transfer. Is there anything else I'm missing?

Cheers,
Eugene


On Tue, Feb 2, 2016 at 7:46 AM, Steve Loughran 
wrote:

>
> > On 1 Feb 2016, at 20:48, eugene miretsky 
> wrote:
> >
> > Spark supports client authentication via shared secret or kerberos (on
> YARN). However, the job itself is sent unencrypted over the network.  Is
> there a way to encrypt the jobs the client submits to cluster?
>
>
> define submission?


> 1. spark-submit submitting the YARN app for launch?  That you get it if
> you turn hadoop IPC encruption on, by settingo
> hadoop.rpc.protection=privacy across the cluster.

2. communications between spark driver and executor. that can use HTTPS
>
> > The rational for this is very similar to  encrypting the HTTP file
> server traffic - Jars may have sensitive data.
> >
> > Cheers,
> > Eugene
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Spark saveAsHadoopFile stage fails with ExecutorLostfailure

2016-02-02 Thread Prabhu Joseph
Hi All,

   Spark job stage having saveAsHadoopFile fails with ExecutorLostFailure
whenever the Executor is run with more cores. The stage is not memory
intensive, executor has 20GB memory. for example,

6 executors each with 6 cores, ExecutorLostFailure happens

10 executors each with 2 cores, saveAsHadoopFile runs fine.

What could be the reason for ExecutorLostFailure failing when cores per
executor is high.



Error: ExecutorLostFailure (executor 3 lost)

16/02/02 04:22:40 WARN TaskSetManager: Lost task 1.3 in stage 15.0 (TID
1318, hdnprd-c01-r01-14):



Thanks,
Prabhu Joseph


Re: Spark 1.6.0 Streaming + Persistance Bug?

2016-02-02 Thread mkhaitman
Actually disregard! Forgot that
spark.dynamicAllocation.cachedExecutorIdleTimeout was defaulted to Infinity,
so lowering that should solve the problem :)

Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-0-Streaming-Persistance-Bug-tp16190p16191.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6.1

2016-02-02 Thread Romi Kuntsman
Hi Michael,
What about the memory leak bug?
https://issues.apache.org/jira/browse/SPARK-11293
Even after the memory rewrite in 1.6.0, it still happens in some cases.
Will it be fixed for 1.6.1?
Thanks,

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Feb 1, 2016 at 9:59 PM, Michael Armbrust 
wrote:

> We typically do not allow changes to the classpath in maintenance releases.
>
> On Mon, Feb 1, 2016 at 8:16 AM, Hamel Kothari 
> wrote:
>
>> I noticed that the Jackson dependency was bumped to 2.5 in master for
>> something spark-streaming related. Is there any reason that this upgrade
>> can't be included with 1.6.1?
>>
>> According to later comments on this thread:
>> https://issues.apache.org/jira/browse/SPARK-8332 and my personal
>> experience using with Spark with Jackson 2.5 hasn't caused any issues but
>> it does have some useful new features. It should be fully backwards
>> compatible according to the Jackson folks.
>>
>> On Mon, Feb 1, 2016 at 10:29 AM Ted Yu  wrote:
>>
>>> SPARK-12624 has been resolved.
>>> According to Wenchen, SPARK-12783 is fixed in 1.6.0 release.
>>>
>>> Are there other blockers for Spark 1.6.1 ?
>>>
>>> Thanks
>>>
>>> On Wed, Jan 13, 2016 at 5:39 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Hey All,

 While I'm not aware of any critical issues with 1.6.0, there are
 several corner cases that users are hitting with the Dataset API that are
 fixed in branch-1.6.  As such I'm considering a 1.6.1 release.

 At the moment there are only two critical issues targeted for 1.6.1:
  - SPARK-12624 - When schema is specified, we should treat undeclared
 fields as null (in Python)
  - SPARK-12783 - Dataset map serialization error

 When these are resolved I'll likely begin the release process.  If
 there are any other issues that we should wait for please contact me.

 Michael

>>>
>>>
>


Spark 1.6.0 Streaming + Persistance Bug?

2016-02-02 Thread mkhaitman
Calling unpersist on an RDD in a spark streaming application does not
actually unpersist the blocks from memory and/or disk. After the RDD has
been processed in a .foreach(rdd) call, I attempt to unpersist the rdd since
it is no longer useful to store in memory/disk. This mainly causes a problem
with dynamic allocation where after the batch of data has been processed, we
want the executors to destroy their executors (giving the cores and memory
back to the cluster while waiting for the next batch processing attempt to
occur). 

Is this a known issue? It's not major in that it doesn't break anything...
just prevents dynamic allocation from working as well as it could if
streaming is combined with it.

Thanks,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-0-Streaming-Persistance-Bug-tp16190.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Lunch dev/run-tests on Windows

2016-02-02 Thread Wen Pei Yu

Hi All

Have any one try launch dev/run-tests on Windows? I face some issues

1. `which` function didn't support file check without extension, like
"java" vs "java.exe", "R" vs "R.exe".

2. Get error below in `run_cmd` function, major issues is some script file
failed run in windows.
WindowsError: [Error 193] %1 is not a valid Win32 application

Thanks
Wenpei.


Re: Encrypting jobs submitted by the client

2016-02-02 Thread Steve Loughran

> On 1 Feb 2016, at 20:48, eugene miretsky  wrote:
> 
> Spark supports client authentication via shared secret or kerberos (on YARN). 
> However, the job itself is sent unencrypted over the network.  Is there a way 
> to encrypt the jobs the client submits to cluster? 


define submission? 

1. spark-submit submitting the YARN app for launch?  That you get it if you 
turn hadoop IPC encruption on, by settingo hadoop.rpc.protection=privacy across 
the cluster.
2. communications between spark driver and executor. that can use HTTPS

> The rational for this is very similar to  encrypting the HTTP file server 
> traffic - Jars may have sensitive data. 
> 
> Cheers,
> Eugene


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org