Re: welcoming Burak and Holden as committers

2017-01-24 Thread Chester Chen
Congratulation to both.

Holden,  we need catch up.


Chester Chen
■ Senior Manager – Data Science & Engineering
3000 Clearview Way
San Mateo, CA 94402

[cid:image001.png@01D27678.9466E4D0]

From: Felix Cheung <felixcheun...@hotmail.com>
Date: Tuesday, January 24, 2017 at 1:20 PM
To: Reynold Xin <r...@databricks.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>
Cc: Holden Karau <holden.ka...@gmail.com>, Burak Yavuz <bu...@databricks.com>
Subject: Re: welcoming Burak and Holden as committers

Congrats and welcome!!


From: Reynold Xin <r...@databricks.com>
Sent: Tuesday, January 24, 2017 10:13:16 AM
To: dev@spark.apache.org
Cc: Burak Yavuz; Holden Karau
Subject: welcoming Burak and Holden as committers

Hi all,

Burak and Holden have recently been elected as Apache Spark committers.

Burak has been very active in a large number of areas in Spark, including 
linear algebra, stats/maths functions in DataFrames, Python/R APIs for 
DataFrames, dstream, and most recently Structured Streaming.

Holden has been a long time Spark contributor and evangelist. She has written a 
few books on Spark, as well as frequent contributions to the Python API to 
improve its usability and performance.

Please join me in welcoming the two!




Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Chester Chen
vote for Option 1.
  1)  Since 2.0 is major API, we are expecting some API changes,
  2)  It helps long term code base maintenance with short term pain on Java
side
  3) Not quite sure how large the code base is using Java DataFrame APIs.





On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xin  wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Chester Chen
for #1-3, the answer is likely No.

  Recently we upgrade to Spark 1.5.1, with CDH5.3, CDH5.4 and HDP2.2  and
others.

  We were using CDH5.3 client to talk to CDH5.4. We were doing this to see
if we support many different hadoop cluster versions without changing the
build. This was ok for yarn-cluster spark 1.3.1, but could not get spark
1.5.1 started. We upgrade the client to CDH5.4, then everything works.

  There are API changes between Apache 2.4 and 2.6, not sure you can mix
match them.

Chester


On Fri, Nov 20, 2015 at 1:59 PM, Sandy Ryza  wrote:

> To answer your fourth question from Cloudera's perspective, we would never
> support a customer running Spark 2.0 on a Hadoop version < 2.6.
>
> -Sandy
>
> On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin  wrote:
>
>> OK I'm not exactly asking for a vote here :)
>>
>> I don't think we should look at it from only maintenance point of view --
>> because in that case the answer is clearly supporting as few versions as
>> possible (or just rm -rf spark source code and call it a day). It is a
>> tradeoff between the number of users impacted and the maintenance burden.
>>
>> So a few questions for those more familiar with Hadoop:
>>
>> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>>
>> 2. If the answer to 1 is yes, are there known, major issues with backward
>> compatibility?
>>
>> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>>
>> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
>> stop? To what extent do you care about running Spark on older Hadoop
>> clusters.
>>
>>
>>
>> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>>>
>>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>>> away.
>>>
>>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>>> later to leverage new spark 2.0 in one year. I think this possible as
>>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>>> Company will have enough time to upgrade cluster.
>>>
>>> +1 for me as well
>>>
>>> Chester
>>>
>>>
>>> now, if you are looking that far ahead, the other big issue is "when to
>>> retire Java 7 support".?
>>>
>>> That's a tough decision for all projects. Hadoop 3.x will be Java 8
>>> only, but nobody has committed the patch to the trunk codebase to force a
>>> java 8 build; + most of *todays* hadoop clusters are Java 7. But as you
>>> can't even download a Java 7 JDK for the desktop from oracle any more
>>> today, 2016 is a time to look at the language support and decide what is
>>> the baseline version
>>>
>>> Commentary from Twitter here -as they point out, it's not just the
>>> server farm that matters, it's all the apps that talk to it
>>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>>>
>>> -Steve
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-06 Thread Chester Chen
+1
Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code,
build locally.

Regression running in Yarn Cluster mode against few internal ML ( logistic
regression, linear regression, random forest and statistic summary) as well
Mlib KMeans. all seems to work fine.

Chester


On Tue, Nov 3, 2015 at 3:22 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 59 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc2:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc2:
> https://repository.apache.org/content/repositories/orgapachespark-1153
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1152
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
>
>


Re: Possible bug on Spark Yarn Client (1.5.1) during kerberos mode ?

2015-10-22 Thread Chester Chen
Steven
  You summarized mostly correct. But there is a couple points I want to
emphasize.

 Not every cluster have the Hive Service enabled. So The Yarn Client
shouldn't try to get the hive delegation token just because security mode
is enabled.

 The Yarn Client code can check if the service is enabled or not
(possible by check hive metastore URI is present or other hive-site.xml
elements). If hive service is not enabled, then we don't need to get hive
delegation token. Hence we don't have the exception.

 If we still try to get hive delegation regardless hive service is
enabled or not ( like the current code is doing now), then code should
still launch the yarn container and spark job, as the user could simply run
a job against HDFS, not accessing Hive.  Of course, access Hive will fail.

 The 3rd point is that not sure why org.spark-project.hive's hive-exec
and orga.apache.hadoop.hive hive-exec behave differently for the same
method.

Chester









On Thu, Oct 22, 2015 at 10:18 AM, Charmee Patel <charm...@gmail.com> wrote:

> A similar issue occurs when interacting with Hive secured by Sentry.
> https://issues.apache.org/jira/browse/SPARK-9042
>
> By changing how Hive Context instance is created, this issue might also be
> resolved.
>
> On Thu, Oct 22, 2015 at 11:33 AM Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> On 22 Oct 2015, at 08:25, Chester Chen <ches...@alpinenow.com> wrote:
>>
>> Doug
>>
>>We are not trying to compiling against different version of hive. The
>> 1.2.1.spark hive-exec is specified on spark 1.5.2 Pom file. We are moving
>> from spark 1.3.1 to 1.5.1. Simply trying to supply the needed
>> dependency. The rest of application (besides spark) simply uses hive 0.13.1.
>>
>>Yes we are using yarn client directly, there are many functions we
>> need and modified are not provided in yarn client. The spark launcher in
>> the current form does not satisfy our requirements (at least last time I
>> see it) there is a discussion thread about several month ago.
>>
>> From spark 1.x  to 1.3.1, we fork the yarn client to achieve these
>> goals ( yarn listener call backs, killApplications, yarn capacities call
>> back etc). In current integration for 1.5.1, to avoid forking the spark, we
>> simply subclass the yarn client overwrites a few methods. But we lost
>> resource capacity call back and estimation by doing this.
>>
>>This is bit off the original topic.
>>
>> I still think there is a bug related to the spark yarn client in case
>> of Kerberos + spark hive-exec dependency.
>>
>> Chester
>>
>>
>> I think I understand what's being implied here.
>>
>>
>>1. In a secure cluster, a spark app needs a hive delegation token  to
>>talk to hive
>>2. Spark yarn Client (org.apache.spark.deploy.yarn.Client) uses
>>reflection to get the delegation token
>>3. The reflection doesn't work, a CFNE exception is logged
>>4. The app should still launch, but it'll be without a hive token ,
>>so attempting to work with Hive will fail.
>>
>> I haven't seen this, because while I do test runs against a kerberos
>> cluster, I wasn't talking to hive from the deployed app.
>>
>>
>> It sounds like this workaround works because the hive RPC protocol is
>> compatible enough with 0.13 that a 0.13 client can ask hive for the token,
>> though then your remote CP is stuck on 0.13
>>
>> Looking at the hive class, the metastore has now made the hive
>> constructor private and gone to a factory method (public static Hive
>> get(HiveConf c) throws HiveException) to get an instance. The reflection
>> code would need to be updated.
>>
>> I'll file a bug with my name next to it
>>
>>
>>
>>


Re: Possible bug on Spark Yarn Client (1.5.1) during kerberos mode ?

2015-10-22 Thread Chester Chen
Thanks Steve
   Likes the slides on kerberos, I have enough scars from Kerberos
while trying to integrated it with (Pig, MapRed, Hive JDBC, and HCatalog
and Spark) etc.  I am still having trouble making Impersonating to work for
HCatalog.  I might send you an offline email to ask some pointers

  Thanks for the ticket.

Chester





On Thu, Oct 22, 2015 at 1:15 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 22 Oct 2015, at 19:32, Chester Chen <ches...@alpinenow.com> wrote:
>
> Steven
>   You summarized mostly correct. But there is a couple points I want
> to emphasize.
>
>  Not every cluster have the Hive Service enabled. So The Yarn Client
> shouldn't try to get the hive delegation token just because security mode
> is enabled.
>
>
> I agree, but it shouldn't be failing with a stack trace. Log -yes, fail
> no.
>
>
>  The Yarn Client code can check if the service is enabled or not
> (possible by check hive metastore URI is present or other hive-site.xml
> elements). If hive service is not enabled, then we don't need to get hive
> delegation token. Hence we don't have the exception.
>
>  If we still try to get hive delegation regardless hive service is
> enabled or not ( like the current code is doing now), then code should
> still launch the yarn container and spark job, as the user could simply run
> a job against HDFS, not accessing Hive.  Of course, access Hive will fail.
>
>
> That's exactly what should be happening: the token is only needed if the
> code tries to talk to hive. The problem is the YARN client doesn't know
> whether that's the case, so it tries every time. It shouldn't be failing
> though.
>
> Created an issue to cover this; I'll see what reflection it takes. I'll
> also pull the code out into a method that can be tested standalone: we
> shoudn't have to wait until a run on UGI.isSecure() mode.
>
> https://issues.apache.org/jira/browse/SPARK-11265
>
>
> Meanwhile, for the curious, these slides include an animation of what goes
> on when a YARN app is launched in a secure cluster, to help explain why
> things seem a bit complicated
>
> http://people.apache.org/~stevel/kerberos/2015-09-kerberos-the-madness.pptx
>
>  The 3rd point is that not sure why org.spark-project.hive's hive-exec
> and orga.apache.hadoop.hive hive-exec behave differently for the same
> method.
>
> Chester
>
>
>
>
>
>
>
>
>
> On Thu, Oct 22, 2015 at 10:18 AM, Charmee Patel <charm...@gmail.com>
> wrote:
>
>> A similar issue occurs when interacting with Hive secured by Sentry.
>> https://issues.apache.org/jira/browse/SPARK-9042
>>
>> By changing how Hive Context instance is created, this issue might also
>> be resolved.
>>
>> On Thu, Oct 22, 2015 at 11:33 AM Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> On 22 Oct 2015, at 08:25, Chester Chen <ches...@alpinenow.com> wrote:
>>>
>>> Doug
>>>
>>>We are not trying to compiling against different version of hive. The
>>> 1.2.1.spark hive-exec is specified on spark 1.5.2 Pom file. We are moving
>>> from spark 1.3.1 to 1.5.1. Simply trying to supply the needed
>>> dependency. The rest of application (besides spark) simply uses hive 0.13.1.
>>>
>>>Yes we are using yarn client directly, there are many functions we
>>> need and modified are not provided in yarn client. The spark launcher in
>>> the current form does not satisfy our requirements (at least last time I
>>> see it) there is a discussion thread about several month ago.
>>>
>>> From spark 1.x  to 1.3.1, we fork the yarn client to achieve these
>>> goals ( yarn listener call backs, killApplications, yarn capacities call
>>> back etc). In current integration for 1.5.1, to avoid forking the spark, we
>>> simply subclass the yarn client overwrites a few methods. But we lost
>>> resource capacity call back and estimation by doing this.
>>>
>>>This is bit off the original topic.
>>>
>>> I still think there is a bug related to the spark yarn client in
>>> case of Kerberos + spark hive-exec dependency.
>>>
>>> Chester
>>>
>>>
>>> I think I understand what's being implied here.
>>>
>>>
>>>1. In a secure cluster, a spark app needs a hive delegation token
>>> to talk to hive
>>>2. Spark yarn Client (org.apache.spark.deploy.yarn.Client) uses
>>>reflection to get the delegation token
>>>3. The reflection doesn't work, a CFNE exception is logged
>>>

Possible bug on Spark Yarn Client (1.5.1) during kerberos mode ?

2015-10-21 Thread Chester Chen
All,

just to see if this happens to other as well.

  This is tested against the

   spark 1.5.1 ( branch 1.5  with label 1.5.2-SNAPSHOT with commit on Tue
Oct 6, 84f510c4fa06e43bd35e2dc8e1008d0590cbe266)

   Spark deployment mode : Spark-Cluster

   Notice that if we enable Kerberos mode, the spark yarn client fails with
the following:

*Could not initialize class org.apache.hadoop.hive.ql.metadata.Hive*
*java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hive.ql.metadata.Hive*
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1252)
at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)


Diving in Yarn Client.scala code and tested against different dependencies
and notice the followings:  if  the kerberos mode is enabled,
Client.obtainTokenForHiveMetastore()
will try to use scala reflection to get Hive and HiveConf and method on
these method.


  val hiveClass =
mirror.classLoader.loadClass("org.apache.hadoop.hive.ql.metadata.Hive")
  val hive = hiveClass.getMethod("get").invoke(null)

  val hiveConf = hiveClass.getMethod("getConf").invoke(hive)
  val hiveConfClass =
mirror.classLoader.loadClass("org.apache.hadoop.hive.conf.HiveConf")

  val hiveConfGet = (param: String) => Option(hiveConfClass
.getMethod("get", classOf[java.lang.String])
.invoke(hiveConf, param))


   If the "org.spark-project.hive" % "hive-exec" % "1.2.1.spark" is used,
then you will get above exception. But if we use the

   "org.apache.hive" % "hive-exec" "0.13.1-cdh5.2.0"

 The above method will not throw exception.


  Here some questions and comments

0) is this a bug ?

1) Why spark-hive hive-exec behave differently ? I understand
spark-hive hive-exec has less dependencies

   but I would expect it functionally the same

2) Where I can find the source code for spark-hive hive-exec ?

3) regarding the method obtainTokenForHiveMetastore(),

   I would assume that the method will first check if the
hive-metastore uri is present before

   trying to get the hive metastore tokens, it seems to invoke the
reflection regardless the hive service in the cluster is enabled or
not.

4) Noticed the obtainTokenForHBase() in the same Class (Client.java) catches

   case e: java.lang.NoClassDefFoundError => logDebug("HBase Class not
found: " + e)

   and just ignore the exception ( log debug),

   but obtainTokenForHiveMetastore() does not catch
NoClassDefFoundError exception, I guess this is the problem.

private def *obtainTokenForHiveMetastore*(conf: Configuration,
credentials: Credentials) {

// rest of code

 } catch {
case e: java.lang.NoSuchMethodException => { logInfo("Hive Method
not found " + e); return }
case e: java.lang.ClassNotFoundException => { logInfo("Hive Class
not found " + e); return }
case e: Exception => { logError("Unexpected Exception " + e)
  throw new RuntimeException("Unexpected exception", e)
}
  }
}


thanks


Chester


Re: Possible bug on Spark Yarn Client (1.5.1) during kerberos mode ?

2015-10-21 Thread Chester Chen
Doug
  thanks for responding.
 >>I think Spark just needs to be compiled against 1.2.1

   Can you elaborate on this, or specific command you are referring ?

   In our build.scala, I was including the following

"org.spark-project.hive" % "hive-exec" % "1.2.1.spark" intransitive()

   I am not sure how the Spark compilation is directly related to this,
please explain.

   When we submit the spark job, the we call Spark Yarn Client.scala
directly ( not using spark-submit).
   The client side is not depending on spark-assembly jar ( which is in the
hadoop cluster).  The job submission actually failed in the client side.

   Currently we get around this by replace the spark's hive-exec with
apache hive-exec.


Chester





On Wed, Oct 21, 2015 at 5:27 PM, Doug Balog <d...@balog.net> wrote:

> See comments below.
>
> > On Oct 21, 2015, at 5:33 PM, Chester Chen <ches...@alpinenow.com> wrote:
> >
> > All,
> >
> > just to see if this happens to other as well.
> >
> >   This is tested against the
> >
> >spark 1.5.1 ( branch 1.5  with label 1.5.2-SNAPSHOT with commit on
> Tue Oct 6, 84f510c4fa06e43bd35e2dc8e1008d0590cbe266)
> >
> >Spark deployment mode : Spark-Cluster
> >
> >Notice that if we enable Kerberos mode, the spark yarn client fails
> with the following:
> >
> > Could not initialize class org.apache.hadoop.hive.ql.metadata.Hive
> > java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.hadoop.hive.ql.metadata.Hive
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at
> org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1252)
> > at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
> > at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
> > at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> > at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
> >
> >
> > Diving in Yarn Client.scala code and tested against different
> dependencies and notice the followings:  if  the kerberos mode is enabled,
> Client.obtainTokenForHiveMetastore() will try to use scala reflection to
> get Hive and HiveConf and method on these method.
> >
> >   val hiveClass =
> mirror.classLoader.loadClass("org.apache.hadoop.hive.ql.metadata.Hive")
> >   val hive = hiveClass.getMethod("get").invoke(null)
> >
> >   val hiveConf = hiveClass.getMethod("getConf").invoke(hive)
> >   val hiveConfClass =
> mirror.classLoader.loadClass("org.apache.hadoop.hive.conf.HiveConf")
> >
> >   val hiveConfGet = (param: String) => Option(hiveConfClass
> > .getMethod("get", classOf[java.lang.String])
> > .invoke(hiveConf, param))
> >
> >If the "org.spark-project.hive" % "hive-exec" % "1.2.1.spark" is
> used, then you will get above exception. But if we use the
> >"org.apache.hive" % "hive-exec" "0.13.1-cdh5.2.0"
> >  The above method will not throw exception.
> >
> >   Here some questions and comments
> > 0) is this a bug ?
>
> I’m not an expert on this, but I think this might not be a bug.
> The Hive integration was redone for 1.5.0, see
> https://issues.apache.org/jira/browse/SPARK-6906
> and I think Spark just needs to be compiled against 1.2.1
>
>
> >
> > 1) Why spark-hive hive-exec behave differently ? I understand spark-hive
> hive-exec has less dependencies
> >but I would expect it functionally the same
>
> I don’t know.
>
> > 2) Where I can find the source code for spark-hive hive-exec ?
>
> I don’t know.
>
> >
> > 3) regarding the method obtainTokenForHiveMetastore(),
> >I would assume that the method will first check if the hive-metastore
> uri is present before
> >trying to get the hive metastore tokens, it seems to invoke the
> reflection regardless the hive service in the cluster is enabled or not.
>
> Checking to see if the hive-megastore.uri is present before trying to get
> a delegation token would be an improvement.
> Also checking to see if we are running in cluster mode would b

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread Chester Chen
Thanks Sean, that make it clear.

On Tue, Sep 1, 2015 at 7:17 AM, Sean Owen <so...@cloudera.com> wrote:

> Any 1.5 RC comes from the latest state of the 1.5 branch at some point
> in time. The next RC will be cut from whatever the latest commit is.
> You can see the tags in git for the specific commits for each RC.
> There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to
> branch 1.5. I would ignore the "SNAPSHOT" version for your purpose.
>
> You can always build from the exact commit that an RC did by looking
> at tags. There is no 1.5.0 yet so you can't build that, but once it's
> released, you would be able to find its tag as well. You can always
> build the latest 1.5.x branch by building from HEAD of that branch.
>
> On Tue, Sep 1, 2015 at 3:13 PM,  <ches...@alpinenow.com> wrote:
> > Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I
> assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code
> ?
> >
> > The reason I am asking these questions is that I would like to know If I
> want build 1.5.0  myself, which commit should I use ?
> >
> > Sent from my iPad
> >
> >> On Sep 1, 2015, at 6:57 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> >> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> >> released. In practice I think it's simpler to follow the defaults of
> >> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> >> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> >> nothing to do with what commits go into 1.5.0; it's an ignorable
> >> detail of the version in POMs in the source tree, which don't mean
> >> much anyway as the source tree itself is not a released version.
> >>
> >>> On Tue, Sep 1, 2015 at 2:48 PM,  <ches...@alpinenow.com> wrote:
> >>> Sorry, I am still not follow. I assume the release would build from
> 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from
> 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last
> commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
> >>>
> >>>
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>>
> >>>> That's correct for the 1.5 branch, right? this doesn't mean that the
> >>>> next RC would have this value. You choose the release version during
> >>>> the release process.
> >>>>
> >>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ches...@alpinenow.com>
> wrote:
> >>>>> Seems that Github branch-1.5 already changing the version to
> 1.5.1-SNAPSHOT,
> >>>>>
> >>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
> >>>>>
> >>>>> Chester
> >>>>>
> >>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>>>>>
> >>>>>> I'm going to -1 the release myself since the issue @yhuai
> identified is
> >>>>>> pretty serious. It basically OOMs the driver for reading any files
> with a
> >>>>>> large number of partitions. Looks like the patch for that has
> already been
> >>>>>> merged.
> >>>>>>
> >>>>>> I'm going to cut rc3 momentarily.
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <
> sandy.r...@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> +1 (non-binding)
> >>>>>>> built from source and ran some jobs against YARN
> >>>>>>>
> >>>>>>> -Sandy
> >>>>>>>
> >>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <
> vaquar.k...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Vaquar khan
> >>>>>>>>
> >>>>>>>> +1 (non-binding, of course)
> >>>>>>>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Chester Chen
Seems that Github branch-1.5 already changing the version to
1.5.1-SNAPSHOT,

I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?

Chester

On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin  wrote:

> I'm going to -1 the release myself since the issue @yhuai identified is
> pretty serious. It basically OOMs the driver for reading any files with a
> large number of partitions. Looks like the patch for that has already been
> merged.
>
> I'm going to cut rc3 momentarily.
>
>
> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza 
> wrote:
>
>> +1 (non-binding)
>> built from source and ran some jobs against YARN
>>
>> -Sandy
>>
>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan 
>> wrote:
>>
>>>
>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>
>>> Regards,
>>> Vaquar khan
>>> +1 (non-binding, of course)
>>>
>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>> 2. Tested pyspark, mllib
>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 2.2. Linear/Ridge/Laso Regression OK
>>> 2.3. Decision Tree, Naive Bayes OK
>>> 2.4. KMeans OK
>>>Center And Scale OK
>>> 2.5. RDD operations OK
>>>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>Model evaluation/optimization (rank, numIter, lambda) with
>>> itertools OK
>>> 3. Scala - MLlib
>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 3.2. LinearRegressionWithSGD OK
>>> 3.3. Decision Tree OK
>>> 3.4. KMeans OK
>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>> 3.6. saveAsParquetFile OK
>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>> registerTempTable, sql OK
>>> 3.8. result = sqlContext.sql("SELECT
>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>> 4.0. Spark SQL from Python OK
>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>> OK
>>> 5.0. Packages
>>> 5.1. com.databricks.spark.csv - read/write OK
>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>> 6.0. DataFrames
>>> 6.1. cast,dtypes OK
>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>> 6.3. joins,sql,set operations,udf OK
>>>
>>> Cheers
>>> 
>>>
>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
 passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc2:

 https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release (published as 1.5.0-rc2) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1141/

 The staging repository for this release (published as 1.5.0) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1140/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


 ===
 How can I help test this release?
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.


 
 What justifies a -1 vote for this release?
 
 This vote is happening towards the end of the 1.5 QA period, so -1
 votes should only occur for significant regressions from 1.4. Bugs already
 present in 1.4, minor regressions, or bugs related to new features will not
 block this release.


 ===
 What should happen to JIRA tickets still targeting 1.5.0?
 ===
 1. It is OK for documentation patches to target 1.5.0 and still go into
 branch-1.5, since documentations will be packaged 

Re: High Availability of Spark Driver

2015-08-28 Thread Chester Chen
Ashish and Steve
 I am also working on the long running Yarn Spark Job. Just start to
focus on failure recovery. This thread of discussion is really helpful.

Chester

On Fri, Aug 28, 2015 at 12:53 AM, Ashish Rawat ashish.ra...@guavus.com
wrote:

 Thanks Steve. I had not spent many brain cycles on analysing the Yarn
 pieces, your insights would be extremely useful.

 I was also considering Zookeeper and Yarn registry for persisting state
 and sharing information. But for a basic POC, I used the file system and
 was able to

1. Preserve Executors.
2. Reconnect Executors back to Driver by storing the Executor
endpoints info into a local file system. When driver restarts, use this
info to send update driver message to executor endpoints. Executors can
then update all of their Akka endpoints and reconnect.
3. Reregister Block Manager and report back blocks. This utilises most
of Spark’s existing code, I only had to update the BlockManagerMaster
endpoint in executors.

 Surprisingly, Spark components took the restart in a much better way than
 I had anticipated and were easy to accept new work :-)

 I am still figuring out other complexities around preserving RDD lineage
 and computation. From my initial analysis, preserving the whole computation
 might be complex and may not be required. Perhaps, the lineage of only the
 cached RDDs can be preserved to recover any lost blocks.

 I am definitely not underestimating the effort, both within Spark and
 around interfacing with Yarn, but just trying to emphasise that a single
 node leading to full application restart, does not seem right for a long
 running service. Thoughts?

 Regards,
 Ashish

 From: Steve Loughran ste...@hortonworks.com
 Date: Thursday, 27 August 2015 4:19 pm
 To: Ashish Rawat ashish.ra...@guavus.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Subject: Re: High Availability of Spark Driver


 On 27 Aug 2015, at 08:42, Ashish Rawat ashish.ra...@guavus.com wrote:

 Hi Patrick,

 As discussed in another thread, we are looking for a solution to the
 problem of lost state on Spark Driver failure. Can you please share Spark’s
 long term strategy for resolving this problem.

 -- Original Mail Content Below --

 We have come across the problem of Spark Applications (on Yarn) requiring
 a restart in case of Spark Driver (or application master) going down. This
 is hugely inconvenient for long running applications which are maintaining
 a big state in memory. The repopulation of state in itself may require a
 downtime of many minutes, which is not acceptable for most live systems.

 As you would have noticed that Yarn community has acknowledged long
 running services as an important class of use cases, and thus identified
 and removed problems in working with long running services in Yarn.

 http://hortonworks.com/blog/support-long-running-services-hadoop-yarn-clusters/


 Yeah, I spent a lot of time on that, or at least using the features, in
 other work under YARN-896, summarised in
 http://www.slideshare.net/steve_l/yarn-services

 It would be great if Spark, which is the most important processing engine
 on Yarn,


 I'f you look at the CPU-hours going in to the big hadoop clusters, it's
 actually MR work and things behind Hive. but: these apps don't attempt HA

 Why not? It requires whatever maintains the overall app status (spark: the
 driver) to persist that state in a way where it can be rebuilt. A restarted
 AM with the retain containers feature turned on gets nothing back from
 YARN except the list of previous allocated containers, and is left to sort
 itself out.

 also figures out issues in working with long running Spark applications
 and publishes recommendations or make framework changes for removing those.
 The need to keep the application running in case of Driver and Application
 Master failure, seems to be an important requirement from this perspective.
 The two most compelling use cases being:

1. Huge state of historical data in *Spark Streaming*, required for
stream processing
2. Very large cached tables in *Spark SQL* (very close to our use case
where we periodically cache RDDs and query using Spark SQL)



 Generally spark streaming is viewed as the big need here, but yes,
 long-lived cached data matters.

 Bear in mind that before Spark 1.5, you can't run any spark YARN app for
 longer than the expiry time of your delegation tokens, so in a secure
 cluster you have a limit of a couple of days anyway. Unless your cluster is
 particularly unreliable, AM failures are usually pretty unlikely in such a
 short timespan. Container failure is more likely as 1) you have more of
 them and 2) if you have pre-emption turned on in the scheduler or are
 pushing the work out to a label containing spot VMs, the will fail.

 In our analysis, for both of these use cases, a working HA solution can be
 built by

1. Preserving the state of executors (not killing them on driver
failures)

 

Re: Welcoming some new committers

2015-06-17 Thread Chester Chen
Congratulations to All.

DB and Sandy, great works !


On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey all,

 Over the past 1.5 months we added a number of new committers to the
 project, and I wanted to welcome them now that all of their respective
 forms, accounts, etc are in. Join me in welcoming the following new
 committers:

 - Davies Liu
 - DB Tsai
 - Kousuke Saruta
 - Sandy Ryza
 - Yin Huai

 Looking forward to more great contributions from all of these folks.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Change for submitting to yarn in 1.3.1

2015-05-25 Thread Chester Chen
I put the design requirements and description in the commit comment. So I
will close the PR. please refer the following commit

https://github.com/AlpineNow/spark/commit/5b336bbfe92eabca7f4c20e5d49e51bb3721da4d



On Mon, May 25, 2015 at 3:21 PM, Chester Chen ches...@alpinenow.com wrote:

 All,
  I have created a PR just for the purpose of helping document the use
 case, requirements and design. As it is unlikely to get merge in. So it
 only used to illustrate the problems we trying and solve and approaches we
 took.

https://github.com/apache/spark/pull/6398


 Hope this helps the discussion

 Chester






 On Fri, May 22, 2015 at 10:55 AM, Kevin Markey kevin.mar...@oracle.com
 wrote:

  Thanks.  We'll look at it.
 I've sent another reply addressing some of your other comments.
 Kevin


 On 05/22/2015 10:27 AM, Marcelo Vanzin wrote:

  Hi Kevin,

  One thing that might help you in the meantime, while we work on a better
 interface for all this...

 On Thu, May 21, 2015 at 5:21 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

 Making *yarn.Client* private has prevented us from moving from Spark
 1.0.x to Spark 1.2 or 1.3 despite many alluring new features.


  Since you're not afraid to use private APIs, and to avoid using ugly
 reflection hacks, you could abuse the fact that private things in Scala are
 not really private most of the time. For example (trimmed to show just
 stuff that might be interesting to you):

 # javap -classpath
 /opt/cloudera/parcels/CDH/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
 org.apache.spark.deploy.yarn.Client
 Compiled from Client.scala
 public class org.apache.spark.deploy.yarn.Client implements
 org.apache.spark.Logging {
   ...
   public org.apache.hadoop.yarn.client.api.YarnClient
 org$apache$spark$deploy$yarn$Client$$yarnClient();
   public void run();
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments,
 org.apache.hadoop.conf.Configuration, org.apache.spark.SparkConf);
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments,
 org.apache.spark.SparkConf);
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments);
 }

  So it should be easy to write a small Java wrapper around this. No less
 hacky than relying on the private-but-public code of before.

 --
 Marcelo






Re: Change for submitting to yarn in 1.3.1

2015-05-25 Thread Chester Chen
All,
 I have created a PR just for the purpose of helping document the use
case, requirements and design. As it is unlikely to get merge in. So it
only used to illustrate the problems we trying and solve and approaches we
took.

   https://github.com/apache/spark/pull/6398


Hope this helps the discussion

Chester






On Fri, May 22, 2015 at 10:55 AM, Kevin Markey kevin.mar...@oracle.com
wrote:

  Thanks.  We'll look at it.
 I've sent another reply addressing some of your other comments.
 Kevin


 On 05/22/2015 10:27 AM, Marcelo Vanzin wrote:

  Hi Kevin,

  One thing that might help you in the meantime, while we work on a better
 interface for all this...

 On Thu, May 21, 2015 at 5:21 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

 Making *yarn.Client* private has prevented us from moving from Spark
 1.0.x to Spark 1.2 or 1.3 despite many alluring new features.


  Since you're not afraid to use private APIs, and to avoid using ugly
 reflection hacks, you could abuse the fact that private things in Scala are
 not really private most of the time. For example (trimmed to show just
 stuff that might be interesting to you):

 # javap -classpath
 /opt/cloudera/parcels/CDH/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
 org.apache.spark.deploy.yarn.Client
 Compiled from Client.scala
 public class org.apache.spark.deploy.yarn.Client implements
 org.apache.spark.Logging {
   ...
   public org.apache.hadoop.yarn.client.api.YarnClient
 org$apache$spark$deploy$yarn$Client$$yarnClient();
   public void run();
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments,
 org.apache.hadoop.conf.Configuration, org.apache.spark.SparkConf);
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments,
 org.apache.spark.SparkConf);
   public
 org.apache.spark.deploy.yarn.Client(org.apache.spark.deploy.yarn.ClientArguments);
 }

  So it should be easy to write a small Java wrapper around this. No less
 hacky than relying on the private-but-public code of before.

 --
 Marcelo





Re: Submit Kill Spark Application program programmatically from another application

2015-05-03 Thread Chester Chen
Sounds like you are in Yarn-Cluster mode.

I created a JIRA SPARK-3913
https://issues.apache.org/jira/browse/SPARK-3913 and PR
https://github.com/apache/spark/pull/2786

is this what you looking for ?




Chester

On Sat, May 2, 2015 at 10:32 PM, Yijie Shen henry.yijies...@gmail.com
wrote:

 Hi,

 I’ve posted this problem in user@spark but find no reply, therefore moved
 to dev@spark, sorry for duplication.

 I am wondering if it is possible to submit, monitor  kill spark
 applications from another service.

 I have wrote a service this:

 parse user commands
 translate them into understandable arguments to an already prepared
 Spark-SQL application
 submit the application along with arguments to Spark Cluster
 using spark-submit from ProcessBuilder
 run generated applications' driver in cluster mode.
 The above 4 steps has been finished, but I have difficulties in these two:

 Query about the applications status, for example, the percentage
 completion.
 Kill queries accordingly
 What I find in spark standalone documentation suggest kill application
 using:

 ./bin/spark-class org.apache.spark.deploy.Client kill master url driver
 ID

 And should find
 the driver ID through the standalone Master web UI at
 http://master url:8080.

 Are there any programmatically methods I could get the driverID submitted
 by my `ProcessBuilder` and query status about the query?

 Any Suggestions?

 —
 Best Regards!
 Yijie Shen


Question regarding some of the changes in [SPARK-3477]

2015-04-14 Thread Chester Chen
While working on upgrading to Spark 1.3.x, notice that the Client and
ClientArgument classes in yarn module are now defined as private[spark]. I
know that these code are mostly used by spark-submit code; but we call Yarn
client directly ( without going through spark-submit) in our spark
integration. This change essentially makes us either 1) to fork out the
code and un-change the  private prefix and build the yarn component ourself
or 2) to move the our code to org.apache.spark packages. Currently we are
using the #1) approach.

So I am curious to know if there is compelling reason to make these Yarn
Client related class private ? Any possibilities make these Client classes
non-private ?

thanks
Chester


Re: broadcast hang out

2015-03-15 Thread Chester Chen
can you just replace Duration.Inf with a shorter duration  ? how about

  import scala.concurrent.duration._
  val timeout = new Timeout(10 seconds)
  Await.result(result.future, timeout.duration)

  or

  val timeout = new FiniteDuration(10, TimeUnit.SECONDS)
  Await.result(result.future, timeout)

  or simply
  import scala.concurrent.duration._
  Await.result(result.future, 10 seconds)



On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:

 Hi all, i meet up with a problem that torrent broadcast hang out in my
 spark cluster (1.2, standalone) , particularly serious when driver and
 executors are cross-region. when i read the code of broadcast i found that
 a sync block read here:

   def fetchBlockSync(host: String, port: Int, execId: String, blockId:
 String): ManagedBuffer = {
 // A monitor for the thread to wait on.
 val result = Promise[ManagedBuffer]()
 fetchBlocks(host, port, execId, Array(blockId),
   new BlockFetchingListener {
 override def onBlockFetchFailure(blockId: String, exception:
 Throwable): Unit = {
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data:
 ManagedBuffer): Unit = {
   val ret = ByteBuffer.allocate(data.size.toInt)
   ret.put(data.nioByteBuffer())
   ret.flip()
   result.success(new NioManagedBuffer(ret))
 }
   })

 Await.result(result.future, Duration.Inf)
   }

 it seems that fetchBlockSync method does not have a timeout limit but wait
 forever ? Anybody can show me how to control the timeout here?



FYI: Prof John Canny is giving a talk on Machine Learning at the limit in SF Big Analytics Meetup

2015-02-10 Thread Chester Chen
Just in case you are in San Francisco, we are having a meetup by Prof John
Canny

http://www.meetup.com/SF-Big-Analytics/events/220427049/


Chester


Re: Unit testing Master-Worker Message Passing

2014-10-15 Thread Chester Chen
You can call resolve method on ActorSelection.resolveOne() to see if the
actor is still there or the path is correct. The method returns a future
and you can wait for it with timeout. This way, you know the actor is live
or already dead or incorrect.

Another way, is to send Identify method to ActorSystem, if it returns with
correct identified message; then you can act on it, otherwise, ...

hope this helps

Chester

On Wed, Oct 15, 2014 at 1:38 PM, Matthew Cheah matthew.c.ch...@gmail.com
wrote:

 What's happening when I do this is that the Worker tries to get the Master
 actor by calling context.actorSelection(), and the RegisterWorker message
 gets sent to the dead letters mailbox instead of being picked up by
 expectMsg. I'm new to Akka and I've tried various ways to registering a
 mock master to no avail.

 I would think there would be at least some kind of test for master - worker
 message passing, no?

 On Wed, Oct 15, 2014 at 11:28 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  I don’t think there are test cases for Worker itself
 
 
  You can
 
 
  val actorRef = TestActorRef[Master](Props(classOf[Master], ...))(
  actorSystem) actorRef.underlyingActor.receive(Heartbeat)
 
  and use expectMsg to test if Master can reply correct message  by
 assuming
  Worker is absolutely correct
 
  Then in another test case to test if Worker can send register message to
  Master after receiving Master’s “re-register” instruction, (in this test
  case assuming that the Master is absolutely right)
 
  Best,
 
  --
  Nan Zhu
 
  On Wednesday, October 15, 2014 at 2:04 PM, Matthew Cheah wrote:
 
  Thanks, the example was helpful.
 
  However, testing the Worker itself is a lot more complicated than
  WorkerWatcher, since the Worker class is quite a bit more complex. Are
  there any tests that inspect the Worker itself?
 
  Thanks,
 
  -Matt Cheah
 
  On Tue, Oct 14, 2014 at 6:40 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 
  You can use akka testkit
 
  Example:
 
 
 
 https://github.com/apache/spark/blob/ef4ff00f87a4e8d38866f163f01741c2673e41da/core/src/test/scala/org/apache/spark/deploy/worker/WorkerWatcherSuite.scala
 
  --
  Nan Zhu
 
  On Tuesday, October 14, 2014 at 9:17 PM, Matthew Cheah wrote:
 
  Hi everyone,
 
  I’m adding some new message passing between the Master and Worker actors
 in
  order to address https://issues.apache.org/jira/browse/SPARK-3736 .
 
  I was wondering if these kinds of interactions are tested in the
 automated
  Jenkins test suite, and if so, where I could find some examples to help
 me
  do the same.
 
  Thanks!
 
  -Matt Cheah
 
 
 
 
 



Re: RFC: Deprecating YARN-alpha API's

2014-09-09 Thread Chester Chen
We were using it until recently, we are talking to our customers and see if
we can get off it.

Chester
Alpine Data Labs



On Tue, Sep 9, 2014 at 10:59 AM, Sean Owen so...@cloudera.com wrote:

 FWIW consensus from Cloudera folk seems to be that there's no need or
 demand on this end for YARN alpha. It wouldn't have an impact if it
 were removed sooner even.

 It will be a small positive to reduce complexity by removing this
 support, making it a little easier to develop for current YARN APIs.

 On Tue, Sep 9, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hi Everyone,
 
  This is a call to the community for comments on SPARK-3445 [1]. In a
  nutshell, we are trying to figure out timelines for deprecation of the
  YARN-alpha API's as Yahoo is now moving off of them. It's helpful for
  us to have a sense of whether anyone else uses these.
 
  Please comment on the JIRA if you have feeback, thanks!
 
  [1] https://issues.apache.org/jira/browse/SPARK-3445
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




is Branch-1.1 SBT build broken for yarn-alpha ?

2014-08-20 Thread Chester Chen
I just updated today's build and tried branch-1.1 for both yarn and
yarn-alpha.

For yarn build, this command seem to work fine.

sbt/sbt -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 projects

for yarn-alpha

sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects

I got the following

Any ideas


Chester

᚛ |branch-1.1|$  *sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha
projects*

Using /Library/Java/JavaVirtualMachines/1.6.0_51-b11-457.jdk/Contents/Home
as default JAVA_HOME.

Note, this will be overridden by -java-home if it is set.

[info] Loading project definition from
/Users/chester/projects/spark/project/project

[info] Loading project definition from
/Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project

[warn] Multiple resolvers having different access mechanism configured with
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).

[info] Loading project definition from /Users/chester/projects/spark/project

org.apache.maven.model.building.ModelBuildingException: 1 problem was
encountered while building the effective model for
org.apache.spark:spark-yarn-alpha_2.10:1.1.0

*[FATAL] Non-resolvable parent POM: Could not find artifact
org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central (
http://repo.maven.apache.org/maven2 http://repo.maven.apache.org/maven2)
and 'parent.relativePath' points at wrong local POM @ line 20, column 11*


 at
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)

at
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)

at
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)

at
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)

at
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)

at
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)

at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)

at
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)

at
com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

at
com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:129)

at
com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

at
com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:129)

at
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)

at com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)

at SparkBuild$.projectDefinitions(SparkBuild.scala:165)

at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:458)

at sbt.Load$$anonfun$24.apply(Load.scala:415)

at sbt.Load$$anonfun$24.apply(Load.scala:415)

at scala.collection.immutable.Stream.flatMap(Stream.scala:442)

at sbt.Load$.loadUnit(Load.scala:415)

at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)

at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)

at
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:93)

at
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:92)

at sbt.BuildLoader.apply(BuildLoader.scala:143)

at sbt.Load$.loadAll(Load.scala:312)

at sbt.Load$.loadURI(Load.scala:264)

at sbt.Load$.load(Load.scala:260)

at sbt.Load$.load(Load.scala:251)

at sbt.Load$.apply(Load.scala:134)

at sbt.Load$.defaultLoad(Load.scala:37)

at sbt.BuiltinCommands$.doLoadProject(Main.scala:473)

at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:467)

at 

Re: is Branch-1.1 SBT build broken for yarn-alpha ?

2014-08-20 Thread Chester Chen
Just tried on master branch, and the master branch works fine for yarn-alpha


On Wed, Aug 20, 2014 at 4:39 PM, Chester Chen ches...@alpinenow.com wrote:

 I just updated today's build and tried branch-1.1 for both yarn and
 yarn-alpha.

 For yarn build, this command seem to work fine.

 sbt/sbt -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 projects

 for yarn-alpha

 sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects

 I got the following

 Any ideas


 Chester

 ᚛ |branch-1.1|$  *sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha
 projects*

 Using /Library/Java/JavaVirtualMachines/1.6.0_51-b11-457.jdk/Contents/Home
 as default JAVA_HOME.

 Note, this will be overridden by -java-home if it is set.

 [info] Loading project definition from
 /Users/chester/projects/spark/project/project

 [info] Loading project definition from
 /Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project

 [warn] Multiple resolvers having different access mechanism configured
 with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
 project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).

 [info] Loading project definition from
 /Users/chester/projects/spark/project

 org.apache.maven.model.building.ModelBuildingException: 1 problem was
 encountered while building the effective model for
 org.apache.spark:spark-yarn-alpha_2.10:1.1.0

 *[FATAL] Non-resolvable parent POM: Could not find artifact
 org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central (
 http://repo.maven.apache.org/maven2 http://repo.maven.apache.org/maven2)
 and 'parent.relativePath' points at wrong local POM @ line 20, column 11*


  at
 org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)

 at
 org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)

 at
 org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)

 at
 org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)

 at
 org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)

 at
 com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)

 at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.AbstractTraversable.map(Traversable.scala:105)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:129)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$$anonfun$12.apply(MavenProjectHelper.scala:129)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.AbstractTraversable.map(Traversable.scala:105)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:129)

 at
 com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)

 at
 com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)

 at SparkBuild$.projectDefinitions(SparkBuild.scala:165)

 at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:458)

 at sbt.Load$$anonfun$24.apply(Load.scala:415)

 at sbt.Load$$anonfun$24.apply(Load.scala:415)

 at scala.collection.immutable.Stream.flatMap(Stream.scala:442)

 at sbt.Load$.loadUnit(Load.scala:415)

 at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)

 at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)

 at
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:93)

 at
 sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:92)

 at sbt.BuildLoader.apply(BuildLoader.scala:143)

 at sbt.Load$.loadAll(Load.scala:312)

 at sbt.Load$.loadURI(Load.scala:264)

 at sbt.Load$.load(Load.scala:260

Re: Master compilation with sbt

2014-07-19 Thread Chester Chen
Works for me as well:


git branch

  branch-0.9

  branch-1.0

* master

Chesters-MacBook-Pro:spark chester$ git pull --rebase

remote: Counting objects: 578, done.

remote: Compressing objects: 100% (369/369), done.

remote: Total 578 (delta 122), reused 418 (delta 71)

Receiving objects: 100% (578/578), 432.42 KiB | 354.00 KiB/s, done.

Resolving deltas: 100% (122/122), done.

From https://github.com/apache/spark

   9c24974..2a73211  master - origin/master

   8e5604b..c93f4a0  branch-0.9 - origin/branch-0.9

   0b0b895..7611840  branch-1.0 - origin/branch-1.0

From https://github.com/apache/spark

 * [new tag] v0.9.2-rc1 - v0.9.2-rc1

First, rewinding head to replay your work on top of it...

Fast-forwarded master to 2a732110d46712c535b75dd4f5a73761b6463aa8.


Chesters-MacBook-Pro:spark chester$ sbt/sbt package



[info] Done packaging.

[success] Total time: 146 s, completed Jul 19, 2014 1:08:52 PM





On Sat, Jul 19, 2014 at 12:50 PM, Mark Hamstra m...@clearstorydata.com
wrote:

  project mllib
 .
 .
 .
  clean
 .
 .
 .
  compile
 .
 .
 .
 test

 ...all works fine for me @2a732110d46712c535b75dd4f5a73761b6463aa8


 On Sat, Jul 19, 2014 at 11:10 AM, Debasish Das debasish.da...@gmail.com
 wrote:

  I am at the reservoir sampling commit:
 
  commit 586e716e47305cd7c2c3ff35c0e828b63ef2f6a8
  Author: Reynold Xin r...@apache.org
  Date:   Fri Jul 18 12:41:50 2014 -0700
 
  sbt/sbt -Dhttp.nonProxyHosts=132.197.10.21
 
   project mllib
 
  [info] Set current project to spark-mllib (in build
  file:/Users/v606014/spark-master/)
 
   compile
 
  [trace] Stack trace suppressed: run last mllib/*:credentials for the full
  output.
 
  [trace] Stack trace suppressed: run last core/*:credentials for the full
  output.
 
  [error] (mllib/*:credentials) org.xml.sax.SAXParseException; lineNumber:
 4;
  columnNumber: 57; Element type settings must be followed by either
  attribute specifications,  or /.
 
  [error] (core/*:credentials) org.xml.sax.SAXParseException; lineNumber:
 4;
  columnNumber: 57; Element type settings must be followed by either
  attribute specifications,  or /.
 
  [error] Total time: 0 s, completed Jul 19, 2014 6:09:24 PM
  On Sat, Jul 19, 2014 at 11:02 AM, Debasish Das debasish.da...@gmail.com
 
  wrote:
 
   Hi,
  
   Is sbt still used for master compilation ? I could compile for
   2.3.0-cdh5.0.2 using maven following the instructions from the website:
  
   http://spark.apache.org/docs/latest/building-with-maven.html
  
   But when I am trying to use sbt for local testing and then I am getting
   some weird errors...Is sbt still used by developers ? I am using
 JDK7...
  
   org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 57; Element
   type settings must be followed by either attribute specifications,
 
  or
   /.
  
   at
  
 
 com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
  
   at
  
 
 com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.seekCloseOfStartTag(XMLDocumentFragmentScannerImpl.java:1394)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1327)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$ContentDriver.scanRootElementHook(XMLDocumentScannerImpl.java:1292)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3122)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:880)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
  
   at
  
 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
  
   at
  
 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
  
   at
  
 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
  
   at
  
 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
  
   at
  
 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
  
   at
  
 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
  
   at
  
 
 

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
)

[info]   ...

[info] - resultant classpath for an application that defines a classpath
for MR *** FAILED ***

[info]   java.lang.ClassCastException: [Ljava.lang.String; cannot be cast
to java.lang.String

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$$anonfun$12.apply(ClientBaseSuite.scala:152)

[info]   at scala.Option.map(Option.scala:145)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.getFieldValue(ClientBaseSuite.scala:180)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$.init(ClientBaseSuite.scala:152)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures$lzycompute(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$4.apply$mcV$sp(ClientBaseSuite.scala:64)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$4.apply(ClientBaseSuite.scala:64)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$4.apply(ClientBaseSuite.scala:64)

[info]   at
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)

[info]   ...

[info] - resultant classpath for an application that defines both
classpaths, YARN and MR *** FAILED ***

[info]   java.lang.ClassCastException: [Ljava.lang.String; cannot be cast
to java.lang.String

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$$anonfun$12.apply(ClientBaseSuite.scala:152)

[info]   at scala.Option.map(Option.scala:145)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.getFieldValue(ClientBaseSuite.scala:180)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$.init(ClientBaseSuite.scala:152)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures$lzycompute(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$5.apply$mcV$sp(ClientBaseSuite.scala:73)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$5.apply(ClientBaseSuite.scala:73)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$5.apply(ClientBaseSuite.scala:73)

[info]   at
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)

[info]   ...

[info] - Local jar URIs















On Thu, Jul 17, 2014 at 12:44 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 To add, we've made some effort to yarn-alpha to work with the 2.0.x line,
 but this was a time when YARN went through wild API changes.  The only line
 that the yarn-alpha profile is guaranteed to work against is the 0.23 line.


 On Thu, Jul 17, 2014 at 12:40 AM, Sean Owen so...@cloudera.com wrote:

  Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by
  projects, shows it as a module. You should only build yarn-stable
  *or* yarn-alpha at any given time.
 
  I don't remember the modules changing in a while. 'yarn-alpha' is for
  YARN before it stabilized, circa early Hadoop 2.0.x. 'yarn-stable' is
  for beta and stable YARN, circa late Hadoop 2.0.x and onwards. 'yarn'
  is code common to both, so should compile with yarn-alpha.
 
  What's the compile error, and are you setting yarn.version? the
  default is to use hadoop.version, but that defaults to 1.0.4 and there
  is no such YARN.
 
  Unless I missed it, I only see compile errors in yarn-stable, and you
  are trying to compile vs YARN alpha versions no?
 
  On Thu, Jul 17, 2014 at 5:39 AM, Chester Chen ches...@alpinenow.com
  wrote:
   Looking further, the yarn and yarn-stable are both for the stable
 version
   of Yarn, that explains the compilation errors when using 2.0.5-alpha
   version of hadoop.
  
   the module yarn-alpha ( although is still on SparkBuild.scala), is no
   longer there in sbt console.
  
  
   projects
  
   [info] In file:/Users/chester/projects/spark/
  
   [info]assembly
  
   [info]bagel
  
   [info]catalyst
  
   [info]core
  
   [info]examples
  
   [info]graphx
  
   [info]hive
  
   [info]mllib
  
   [info]oldDeps
  
   [info]repl
  
   [info]spark
  
   [info]sql
  
   [info]streaming
  
   [info]streaming-flume
  
   [info]streaming-kafka
  
   [info]streaming-mqtt
  
   [info]streaming-twitter
  
   [info]streaming-zeromq
  
   [info]tools
  
   [info]yarn
  
   [info]  * yarn-stable
  
  
   On Wed, Jul 16, 2014 at 5:41 PM, Chester Chen ches...@alpinenow.com
  wrote:
  
   Hmm
   looks like a Build script issue:
  
   I run the command with :
  
   sbt/sbt clean *yarn/*test:compile
  
   but errors came from
  
   [error] 40 errors found
  
   [error] (*yarn-stable*/compile:compile) Compilation failed
  
  
   Chester
  
  
   On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen ches...@alpinenow.com
   wrote:
  
   Hi, Sandy
  
   We do have some issue with this. The difference is in Yarn-Alpha
  and
   Yarn Stable

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
OK   I will create PR.

thanks



On Thu, Jul 17, 2014 at 7:58 AM, Sean Owen so...@cloudera.com wrote:

 Looks like a real problem. I see it too. I think the same workaround
 found in ClientBase.scala needs to be used here. There, the fact that
 this field can be a String or String[] is handled explicitly. In fact
 I think you can just call to ClientBase for this? PR it, I say.

 On Thu, Jul 17, 2014 at 3:24 PM, Chester Chen ches...@alpinenow.com
 wrote:
  val knownDefMRAppCP: Seq[String] =
getFieldValue[String, Seq[String]](classOf[MRJobConfig],
 
   DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
   Seq[String]())(a =
 a.split(,))
 
  will fail for yarn-alpha.
 
  sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha yarn-alpha/test
 



Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:41:
object api is not a member of package org.apache.hadoop.yarn.client

[error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest

[error]  ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:65:
not found: type AMRMClient

[error] val amClient: AMRMClient[ContainerRequest],

[error]   ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:389:
not found: type ContainerRequest

[error] ): ArrayBuffer[ContainerRequest] = {

[error]^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:388:
not found: type ContainerRequest

[error]   hostContainers: ArrayBuffer[ContainerRequest]

[error]   ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:405:
not found: type ContainerRequest

[error] val requestedContainers = new
ArrayBuffer[ContainerRequest](rackToCounts.size)

[error]   ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:434:
not found: type ContainerRequest

[error] val containerRequests: List[ContainerRequest] =

[error] ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:508:
not found: type ContainerRequest

[error] ): ArrayBuffer[ContainerRequest] = {

[error]^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:446:
not found: type ContainerRequest

[error] val hostContainerRequests = new
ArrayBuffer[ContainerRequest](preferredHostToCount.size)

[error] ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:458:
not found: type ContainerRequest

[error] val rackContainerRequests: List[ContainerRequest] =
createRackResourceRequests(

[error] ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:467:
not found: type ContainerRequest

[error] val containerRequestBuffer = new
ArrayBuffer[ContainerRequest](

[error]  ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:542:
not found: type ContainerRequest

[error] ): ArrayBuffer[ContainerRequest] = {

[error]^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:545:
value newInstance is not a member of object
org.apache.hadoop.yarn.api.records.Resource

[error] val resource = Resource.newInstance(memoryRequest,
executorCores)

[error] ^

[error]
/Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:550:
not found: type ContainerRequest

[error] val requests = new ArrayBuffer[ContainerRequest]()

[error]^

[error] 40 errors found

[error] (yarn-stable/compile:compile) Compilation failed

[error] Total time: 98 s, completed Jul 16, 2014 5:14:44 PM













On Wed, Jul 16, 2014 at 4:19 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Ron,

 I just checked and this bug is fixed in recent releases of Spark.

 -Sandy


 On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com
 wrote:

 Ron,
 Which distribution and Version of Hadoop are you using ?

  I just looked at CDH5 (  hadoop-mapreduce-client-core-
 2.3.0-cdh5.0.0),

 MRJobConfig does have the field :

 java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

 Chester



 On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez zlgonza...@yahoo.com
 wrote:

 Hi,
   I was doing programmatic submission of Spark yarn jobs and I saw code
 in ClientBase.getDefaultYarnApplicationClasspath():

 val field =
 classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH)
 MRJobConfig doesn't have this field so the created launch env is
 incomplete. Workaround is to set yarn.application.classpath with the value
 from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.

 This results in having the spark job hang if the submission config is
 different from the default config. For example, if my resource manager port
 is 8050 instead of 8030, then the spark app

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Hmm
looks like a Build script issue:

I run the command with :

sbt/sbt clean *yarn/*test:compile

but errors came from

[error] 40 errors found

[error] (*yarn-stable*/compile:compile) Compilation failed


Chester


On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen ches...@alpinenow.com wrote:

 Hi, Sandy

 We do have some issue with this. The difference is in Yarn-Alpha and
 Yarn Stable ( I noticed that in the latest build, the module name has
 changed,
  yarn-alpha -- yarn
  yarn -- yarn-stable
 )

 For example:  MRJobConfig.class
 the field:
 DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH


 In Yarn-Alpha : the field returns   java.lang.String[]

   java.lang.String[] DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

 while in Yarn-Stable, it returns a String

   java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

 So in ClientBaseSuite.scala

 The following code:

 val knownDefMRAppCP: Seq[String] =
   getFieldValue[*String*, Seq[String]](classOf[MRJobConfig],

  DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
  Seq[String]())(a =
 *a.split(,)*)


 works for yarn-stable, but doesn't work for yarn-alpha.

 This is the only failure for the SNAPSHOT I downloaded 2 weeks ago.  I
 believe this can be refactored to yarn-alpha module and make different
 tests according different API signatures.

  I just update the master branch and build doesn't even compile for
 Yarn-Alpha (yarn) model. Yarn-Stable compile with no error and test passed.


 Does the Spark Jenkins job run against yarn-alpha ?





 Here is output from yarn-alpha compilation:

 I got the 40 compilation errors.

 sbt/sbt clean yarn/test:compile

 Using /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home as
 default JAVA_HOME.

 Note, this will be overridden by -java-home if it is set.

 [info] Loading project definition from
 /Users/chester/projects/spark/project/project

 [info] Loading project definition from
 /Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project

 [warn] Multiple resolvers having different access mechanism configured
 with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
 project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).

 [info] Loading project definition from
 /Users/chester/projects/spark/project

 NOTE: SPARK_HADOOP_VERSION is deprecated, please use
 -Dhadoop.version=2.0.5-alpha

 NOTE: SPARK_YARN is deprecated, please use -Pyarn flag.

 [info] Set current project to spark-parent (in build
 file:/Users/chester/projects/spark/)

 [success] Total time: 0 s, completed Jul 16, 2014 5:13:06 PM

 [info] Updating {file:/Users/chester/projects/spark/}core...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [info] Updating {file:/Users/chester/projects/spark/}yarn...

 [info] Updating {file:/Users/chester/projects/spark/}yarn-stable...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [info] Resolving commons-net#commons-net;3.1 ...

 [info] Compiling 358 Scala sources and 34 Java sources to
 /Users/chester/projects/spark/core/target/scala-2.10/classes...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.scala:43:
 constructor TaskAttemptID in class TaskAttemptID is deprecated: see
 corresponding Javadoc for more information.

 [warn] new TaskAttemptID(jtIdentifier, jobId, isMap, taskId,
 attemptId)

 [warn] ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:501:
 constructor Job in class Job is deprecated: see corresponding Javadoc for
 more information.

 [warn] val job = new NewHadoopJob(hadoopConfiguration)

 [warn]   ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:634:
 constructor Job in class Job is deprecated: see corresponding Javadoc for
 more information.

 [warn] val job = new NewHadoopJob(conf)

 [warn]   ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala:167:
 constructor TaskID in class TaskID is deprecated: see corresponding Javadoc
 for more information.

 [warn] new TaskAttemptID(new TaskID(jID.value, true, splitID),
 attemptID))

 [warn]   ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala:188:
 method makeQualified in class Path is deprecated: see corresponding Javadoc
 for more information.

 [warn] outputPath.makeQualified(fs)

 [warn]^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:84:
 method isDir in class FileStatus is deprecated: see corresponding Javadoc
 for more information.

 [warn] if (!fs.getFileStatus(path).isDir

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Looking further, the yarn and yarn-stable are both for the stable version
of Yarn, that explains the compilation errors when using 2.0.5-alpha
version of hadoop.

the module yarn-alpha ( although is still on SparkBuild.scala), is no
longer there in sbt console.


 projects

[info] In file:/Users/chester/projects/spark/

[info]assembly

[info]bagel

[info]catalyst

[info]core

[info]examples

[info]graphx

[info]hive

[info]mllib

[info]oldDeps

[info]repl

[info]spark

[info]sql

[info]streaming

[info]streaming-flume

[info]streaming-kafka

[info]streaming-mqtt

[info]streaming-twitter

[info]streaming-zeromq

[info]tools

[info]yarn

[info]  * yarn-stable


On Wed, Jul 16, 2014 at 5:41 PM, Chester Chen ches...@alpinenow.com wrote:

 Hmm
 looks like a Build script issue:

 I run the command with :

 sbt/sbt clean *yarn/*test:compile

 but errors came from

 [error] 40 errors found

 [error] (*yarn-stable*/compile:compile) Compilation failed


 Chester


 On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen ches...@alpinenow.com
 wrote:

 Hi, Sandy

 We do have some issue with this. The difference is in Yarn-Alpha and
 Yarn Stable ( I noticed that in the latest build, the module name has
 changed,
  yarn-alpha -- yarn
  yarn -- yarn-stable
 )

 For example:  MRJobConfig.class
 the field:
 DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH


 In Yarn-Alpha : the field returns   java.lang.String[]

   java.lang.String[] DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

 while in Yarn-Stable, it returns a String

   java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

 So in ClientBaseSuite.scala

 The following code:

 val knownDefMRAppCP: Seq[String] =
   getFieldValue[*String*, Seq[String]](classOf[MRJobConfig],

  DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
  Seq[String]())(a =
 *a.split(,)*)


 works for yarn-stable, but doesn't work for yarn-alpha.

 This is the only failure for the SNAPSHOT I downloaded 2 weeks ago.  I
 believe this can be refactored to yarn-alpha module and make different
 tests according different API signatures.

  I just update the master branch and build doesn't even compile for
 Yarn-Alpha (yarn) model. Yarn-Stable compile with no error and test passed.


 Does the Spark Jenkins job run against yarn-alpha ?





 Here is output from yarn-alpha compilation:

 I got the 40 compilation errors.

 sbt/sbt clean yarn/test:compile

 Using /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home as
 default JAVA_HOME.

 Note, this will be overridden by -java-home if it is set.

 [info] Loading project definition from
 /Users/chester/projects/spark/project/project

 [info] Loading project definition from
 /Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project

 [warn] Multiple resolvers having different access mechanism configured
 with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
 project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).

 [info] Loading project definition from
 /Users/chester/projects/spark/project

 NOTE: SPARK_HADOOP_VERSION is deprecated, please use
 -Dhadoop.version=2.0.5-alpha

 NOTE: SPARK_YARN is deprecated, please use -Pyarn flag.

 [info] Set current project to spark-parent (in build
 file:/Users/chester/projects/spark/)

 [success] Total time: 0 s, completed Jul 16, 2014 5:13:06 PM

 [info] Updating {file:/Users/chester/projects/spark/}core...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [info] Updating {file:/Users/chester/projects/spark/}yarn...

 [info] Updating {file:/Users/chester/projects/spark/}yarn-stable...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [info] Resolving commons-net#commons-net;3.1 ...

 [info] Compiling 358 Scala sources and 34 Java sources to
 /Users/chester/projects/spark/core/target/scala-2.10/classes...

 [info] Resolving org.fusesource.jansi#jansi;1.4 ...

 [info] Done updating.

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.scala:43:
 constructor TaskAttemptID in class TaskAttemptID is deprecated: see
 corresponding Javadoc for more information.

 [warn] new TaskAttemptID(jtIdentifier, jobId, isMap, taskId,
 attemptId)

 [warn] ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:501:
 constructor Job in class Job is deprecated: see corresponding Javadoc for
 more information.

 [warn] val job = new NewHadoopJob(hadoopConfiguration)

 [warn]   ^

 [warn]
 /Users/chester/projects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:634:
 constructor Job in class Job is deprecated: see corresponding Javadoc for
 more information.

 [warn] val job = new NewHadoopJob(conf)

 [warn]   ^

 [warn]
 /Users/chester/projects/spark

Re: Application level progress monitoring and communication

2014-06-30 Thread Chester Chen
Reynold
thanks for the reply. It's true, this is more to Yarn communication
than Spark.
But this is a general enough problem for all the YARN_CLUSTER mode
application. I thought
just to reach out to the community.

  If we choose to using Akka solution, then this is related to Spark, as
there is only one Akka actor system per JVM.

  Thanks for the suggestion regarding pass the client IP address. I was
only thinking  how to find out the IP address
of the spark drive node initially.

  Reporting Progress is just one of the use case, stopping spark job, We
are also considering interactive query jobs.

This gives me some thing to start with. I will try to with Akka first. Will
let community know once we got somewhere.

thanks
Chester


On Sun, Jun 29, 2014 at 11:07 PM, Reynold Xin r...@databricks.com wrote:

 This isn't exactly about Spark itself, more about how an application on
 YARN/Mesos can communicate with another one.

 How about your application launch program just takes in a parameter (or env
 variable or command line argument) for the IP address of your client
 application, and just send updates? You basically just want to send
 messages to report progress. You can do it with a lot of different ways,
 such as Akka, custom REST API, Thrift ... I think any of them will do.




 On Sun, Jun 29, 2014 at 7:57 PM, Chester Chen ches...@alpinenow.com
 wrote:

  Hi Spark dev community:
 
  I have several questions regarding Application and Spark communication
 
  1) Application Level Progress Monitoring
 
  Currently, our application using in YARN_CLUSTER model running Spark
 Jobs.
  This works well so far, but we would like to monitoring the application
  level progress ( not spark system level progress).
 
  For example,
  If we are doing Machine Learning Training, I would like to send some
  message back the our application, current status of the training, number
 of
  iterations etc via API.
 
  We can't use YARN_CLIENT mode for this purpose as we are running the
 spark
  application in servlet container (tomcat/Jetty). If we run the
 yarn_client
  mode, we will be limited to one SparkContext per JVM.
 
  So we are considering to leverage Akka messaging, essentially create
  another Actor to send message back to the client application.
  Notice that Spark already has an Akka ActorSystem defined for each
  Executor. All we need to find Actor address (host, port) for the spark
  driver executor.
 
  The trouble is that driver's host and port are not known until later when
  Resource Manager give to the executor node. How to communicate the host,
  port info back to the client application ?
 
  May be there is an Yarn API to obtain this information from Yarn Client.
 
 
  2) Application and Spark Job communication In YARN Cluster mode.
 
  There are several use cases we are thinking may require communication
  between the client side application and Spark Running Job.
 
   One example,
 * Try to stop a running job -- while job is running, abort the
 long
  running job in Yarn
 
Again, we are think to use Akka Actor to send a STOP job message.
 
 
 
  So here some of  questions:
 
  * Is there any work regarding this area in the community ?
 
  * what do you think the Akka approach ? Alternatives ?
 
  * Is there a way to get Spark's Akka host and port from Yarn Resource
  Manager to Yarn Client ?
 
  Any suggestions welcome
 
  Thanks
  Chester
 



Re: spark config params conventions

2014-03-14 Thread Chester Chen
Based on typesafe config maintainer's response, with latest version of 
typeconfig, the double quote is no longer needed for key like 
spark.speculation, so you don't need code to strip the quotes



Chester
Alpine data labs

Sent from my iPhone

On Mar 12, 2014, at 2:50 PM, Aaron Davidson ilike...@gmail.com wrote:

 One solution for typesafe config is to use
 spark.speculation = true
 
 Typesafe will recognize the key as a string rather than a path, so the name 
 will actually be \spark.speculation\, so you need to handle this 
 contingency when passing the config operations to spark (stripping the quotes 
 from the key).
 
 Solving this in Spark itself is a little tricky because there are ~5 such 
 conflicts (spark.serializer, spark.speculation, spark.locality.wait, 
 spark.shuffle.spill, and spark.cleaner.ttl), some of which are used pretty 
 frequently. We could provide aliases for all of these in Spark, but actually 
 deprecating the old ones would affect many users, so we could only do that if 
 enough users would benefit from fully hierarchical config options.
 
 
 
 On Wed, Mar 12, 2014 at 9:24 AM, Mark Hamstra m...@clearstorydata.com wrote:
 That's the whole reason why some of the intended configuration changes were 
 backed out just before the 0.9.0 release.  It's a well-known issue, even if 
 a completely satisfactory solution isn't as well-known and is probably 
 something which should do another iteration on. 
 
 
 On Wed, Mar 12, 2014 at 9:10 AM, Koert Kuipers ko...@tresata.com wrote:
 i am reading the spark configuration params from another configuration 
 object (typesafe config) before setting them as system properties.
 
 i noticed typesafe config has trouble with settings like:
 spark.speculation=true
 spark.speculation.interval=0.5
 
 the issue seems to be that if spark.speculation is a container that has  
 more values inside then it cannot be also a value itself, i think. so this 
 would work fine:
 spark.speculation.enabled=true
 spark.speculation.interval=0.5
 
 just a heads up. i would probably suggest we avoid this situation.