from:"Nan Zhu"

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Nan Zhu

just curious what happened on google’s spark operator?

On Thu, Nov 9, 2023 at 19:12 Ilan Filonenko  wrote:

> +1
>
> On Thu, Nov 9, 2023 at 7:43 PM Ryan Blue  wrote:
>
>> +1
>>
>> On Thu, Nov 9, 2023 at 4:23 PM Hussein Awala  wrote:
>>
>>> +1 for creating an official Kubernetes operator for Apache Spark
>>>
>>> On Fri, Nov 10, 2023 at 12:38 AM huaxin gao 
>>> wrote:
>>>
 +1

>>>
 On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:

> +1
>
> To be completely transparent, I am employed in the same department as
> Zhou at Apple.
>
> I support this proposal, provided that we witness community adoption
> following the release of the Flink Kubernetes operator, streamlining Flink
> deployment on Kubernetes.
>
> A well-maintained official Spark Kubernetes operator is essential for
> our Spark community as well.
>
> DB Tsai  |  https://www.dbtsai.com/
> 
>  |  PGP 42E5B25A8F7A82C1
>
> On Nov 9, 2023, at 12:05 PM, Zhou Jiang 
> wrote:
>
> Hi Spark community,
> I'm reaching out to initiate a conversation about the possibility of
> developing a Java-based Kubernetes operator for Apache Spark. Following 
> the
> operator pattern (
> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
> ),
> Spark users may manage applications and related components seamlessly 
> using
> native tools like kubectl. The primary goal is to simplify the Spark user
> experience on Kubernetes, minimizing the learning curve and operational
> complexities and therefore enable users to focus on the Spark application
> development.
> Although there are several open-source Spark on Kubernetes operators
> available, none of them are officially integrated into the Apache Spark
> project. As a result, these operators may lack active support and
> development for new features. Within this proposal, our aim is to 
> introduce
> a Java-based Spark operator as an integral component of the Apache Spark
> project. This solution has been employed internally at Apple for multiple
> years, operating millions of executors in real production environments. 
> The
> use of Java in this solution is intended to accommodate a wider user and
> contributor audience, especially those who are familiar with Scala.
> Ideally, this operator should have its dedicated repository, similar
> to Spark Connect Golang or Spark Docker, allowing it to maintain a loose
> connection with the Spark release cycle. This model is also followed by 
> the
> Apache Flink Kubernetes operator.
> We believe that this project holds the potential to evolve into a
> thriving community project over the long run. A comparison can be drawn
> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
> Kubernetes operator, making it a part of the Apache Flink project (
> https://github.com/apache/flink-kubernetes-operator
> ).
> This move has gained wide industry adoption and contributions from the
> community. In a mere year, the Flink operator has garnered more than 600
> stars and has attracted contributions from over 80 contributors. This
> showcases the level of community interest and collaborative momentum that
> can be achieved in similar scenarios.
> More details can be found at SPIP doc : Spark Kubernetes Operator
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Nan Zhu

 for EMR, I think they show 3.1.2-amazon in Spark UI, no?


On Wed, Jun 7, 2023 at 11:30 Grisha Weintraub 
wrote:

> Hi,
>
> I am not taking sides here, but just for fairness, I think it should be
> noted that AWS EMR does exactly the same thing.
> We choose the EMR version (e.g., 6.4.0) and it has an associated Spark
> version (e.g., 3.1.2).
> The Spark version here is not the original Apache version but AWS Spark
> distribution.
>
> On Wed, Jun 7, 2023 at 8:24 PM Dongjoon Hyun 
> wrote:
>
>> I disagree with you in several ways.
>>
>> The following is not a *minor* change like the given examples
>> (alterations to the start-up and shutdown scripts, configuration files,
>> file layout etc.).
>>
>> > The change you cite meets the 4th point, minor change, made for
>> integration reasons.
>>
>> The following is also wrong. There is no such point of state of Apache
>> Spark 3.4.0 after 3.4.0 tag creation. Apache Spark community didn't allow
>> Scala reverting patches in both `master` branch and `branch-3.4`.
>>
>> > There is no known technical objection; this was after all at one point
>> the state of Apache Spark.
>>
>> Is the following your main point? So, you are selling a box "including
>> Harry Potter by J. K. Rolling whose main character is Barry instead of
>> Harry", but it's okay because you didn't sell the book itself? And, as a
>> cloud-vendor, you borrowed the box instead of selling it like private
>> libraries?
>>
>> > There is no standalone distribution of Apache Spark anywhere here.
>>
>> We are not asking a big thing. Why are you so reluctant to say you are
>> not "Apache Spark 3.4.0" by simply saying "Apache Spark 3.4.0-databricks".
>> What is the marketing reason here?
>>
>> Dongjoon.
>>
>>
>> On Wed, Jun 7, 2023 at 9:27 AM Sean Owen  wrote:
>>
>>> Hi Dongjoon, I think this conversation is not advancing anymore. I
>>> personally consider the matter closed unless you can find other support or
>>> respond with more specifics. While this perhaps should be on private@,
>>> I think it's not wrong as an instructive discussion on dev@.
>>>
>>> I don't believe you've made a clear argument about the problem, or how
>>> it relates specifically to policy. Nevertheless I will show you my logic.
>>>
>>> You are asserting that a vendor cannot call a product Apache Spark 3.4.0
>>> if it omits a patch updating a Scala maintenance version. This difference
>>> has no known impact on usage, as far as I can tell.
>>>
>>> Let's see what policy requires:
>>>
>>> 1/ All source code changes must meet at least one of the acceptable
>>> changes criteria set out below:
>>> - The change has accepted by the relevant Apache project community for
>>> inclusion in a future release. Note that the process used to accept changes
>>> and how that acceptance is documented varies between projects.
>>> - A change is a fix for an undisclosed security issue; and the fix is
>>> not publicly disclosed as as security fix; and the Apache project has been
>>> notified of the both issue and the proposed fix; and the PMC has rejected
>>> neither the vulnerability report nor the proposed fix.
>>> - A change is a fix for a bug; and the Apache project has been notified
>>> of both the bug and the proposed fix; and the PMC has rejected neither the
>>> bug report nor the proposed fix.
>>> - Minor changes (e.g. alterations to the start-up and shutdown scripts,
>>> configuration files, file layout etc.) to integrate with the target
>>> platform providing the Apache project has not objected to those changes.
>>>
>>> The change you cite meets the 4th point, minor change, made for
>>> integration reasons. There is no known technical objection; this was after
>>> all at one point the state of Apache Spark.
>>>
>>>
>>> 2/ A version number must be used that both clearly differentiates it
>>> from an Apache Software Foundation release and clearly identifies the
>>> Apache Software Foundation version on which the software is based.
>>>
>>> Keep in mind the product here is not "Apache Spark", but the "Databricks
>>> Runtime 13.1 (including Apache Spark 3.4.0)". That is, there is far more
>>> than a version number differentiating this product from Apache Spark. There
>>> is no standalone distribution of Apache Spark anywhere here. I believe that
>>> easily matches the intent.
>>>
>>>
>>> 3/ The documentation must clearly identify the Apache Software
>>> Foundation version on which the software is based.
>>>
>>> Clearly, yes.
>>>
>>>
>>> 4/ The end user expects that the distribution channel will back-port
>>> fixes. It is not necessary to back-port all fixes. Selection of fixes to
>>> back-port must be consistent with the update policy of that distribution
>>> channel.
>>>
>>> I think this is safe to say too. Indeed this explicitly contemplates not
>>> back-porting a change.
>>>
>>>
>>> Backing up, you can see from this document that the spirit of it is:
>>> don't include changes in your own Apache Foo x.y that aren't wanted by the
>>> project, and still cal

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Nan Zhu

I am not sure if it is a good practice to have breaking changes in
dependencies for maintenance releases

On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:

> Hello,
>
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
> dev@parquet
> 
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
> Thank you in advance,
>
>michael
>

Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Nan Zhu

just filed a JIRA in https://issues.apache.org/jira/browse/SPARK-26862
'
this issue only happens in 2.4.0 but not in 2.3.2

anyone would help to look into that?



On Tue, Feb 12, 2019 at 10:41 AM DB Tsai  wrote:

> Great. I'll prepare the release for voting. Thanks!
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
> > On Feb 12, 2019, at 4:11 AM, Wenchen Fan  wrote:
> >
> > +1 for 2.4.1
> >
> > On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon 
> wrote:
> > +1 for 2.4.1
> >
> > 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성:
> > > SPARK-23539 is a non-trivial improvement, so probably would not be
> back-ported to 2.4.x.
> >
> > Got it. It seems reasonable.
> >
> > Committers:
> >
> > Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this
> feature.
> >
> > Thanks,
> > Dongjin
> >
> > On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro 
> wrote:
> > +1, too.
> > branch-2.4 accumulates too many commits..:
> >
> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
> >
> > On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun 
> wrote:
> > Thank you, DB.
> >
> > +1, Yes. It's time for preparing 2.4.1 release.
> >
> > Bests,
> > Dongjoon.
> >
> > On 2019/02/12 03:16:05, Sean Owen  wrote:
> > > I support a 2.4.1 release now, yes.
> > >
> > > SPARK-23539 is a non-trivial improvement, so probably would not be
> > > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
> > > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
> > > it, but it could go in if otherwise ready.
> > >
> > >
> > > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee 
> wrote:
> > > >
> > > > Hi DB,
> > > >
> > > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a
> little bit ago, but it has not included in 2.3.0 nor get enough review.
> > > >
> > > > Thanks,
> > > > Dongjin
> > > >
> > > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
> > > > [^2]: https://github.com/apache/spark/pull/22282
> > > >
> > > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim 
> wrote:
> > > >>
> > > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is
> submitted, I hope it can be reviewed and included within Spark 2.4.1 -
> otherwise it will be a long-live correctness issue.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
> > > >> 2. https://github.com/apache/spark/pull/23634
> > > >>
> > > >>
> > > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
> > > >>>
> > > >>> Hello all,
> > > >>>
> > > >>> I am preparing to cut a new Apache 2.4.1 release as there are many
> bugs and correctness issues fixed in branch-2.4.
> > > >>>
> > > >>> The list of addressed issues are
> https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
> > > >>>
> > > >>> Let me know if you have any concern or any PR you would like to
> get in.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>>
> -
> > > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >>>
> > > >
> > > >
> > > > --
> > > > Dongjin Lee
> > > >
> > > > A hitchhiker in the mathematical world.
> > > >
> > > > github: github.com/dongjinleekr
> > > > linkedin: kr.linkedin.com/in/dongjinleekr
> > > > speakerdeck: speakerdeck.com/dongjin
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> >
> > --
> > ---
> > Takeshi Yamamuro
> >
> >
> > --
> > Dongjin Lee
> >
> > A hitchhiker in the mathematical world.
> >
> > github: github.com/dongjinleekr
> > linkedin: kr.linkedin.com/in/dongjinleekr
> > speakerdeck: speakerdeck.com/dongjin
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Nan Zhu

.how I skipped the last part

On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:

> Yes, Nan, totally agree. To be on the same page, that's exactly what I
> wrote wasn't it?
>
> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>
>> besides that, one of the things which is needed by multiple frameworks is
>> to schedule tasks in a single wave
>>
>> i.e.
>>
>> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
>> is desired to provide a capability to ensure that either we run 50 tasks at
>> once, or we should quit the complete application/job after some timeout
>> period
>>
>> Best,
>>
>> Nan
>>
>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin  wrote:
>>
>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>> single task, retry the entire stage, and the entire stage of tasks need to
>>> be scheduled all at once.
>>>
>>>
>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>>
>>>>>
>>>>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>>>>doesn’t match the execution model of distributed ML/DL frameworks 
>>>>>> that are
>>>>>>typically MPI-based, and rerunning a single task would lead to the 
>>>>>> entire
>>>>>>system hanging. A whole stage needs to be re-run.
>>>>>>
>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts
>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>>> with some compromised solutions. With the new execution model, we can set
>>>>> up a hybrid cluster and do all-reduce properly.
>>>>>
>>>>>
>>>> Is there a particular new execution model you are referring to or do we
>>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>>> and I dont think we have support for that in the scheduler right now.
>>>>
>>>>>
>>>>>> --
>>>>>
>>>>> Xiangrui Meng
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Databricks Inc. [image: http://databricks.com]
>>>>> <http://databricks.com/>
>>>>>
>>>>
>>>>
>>

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Nan Zhu

besides that, one of the things which is needed by multiple frameworks is
to schedule tasks in a single wave

i.e.

if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
is desired to provide a capability to ensure that either we run 50 tasks at
once, or we should quit the complete application/job after some timeout
period

Best,

Nan

On Tue, May 8, 2018 at 11:10 AM, Reynold Xin  wrote:

> I think that's what Xiangrui was referring to. Instead of retrying a
> single task, retry the entire stage, and the entire stage of tasks need to
> be scheduled all at once.
>
>
> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>>
- Fault tolerance and execution model: Spark assumes fine-grained
task recovery, i.e. if something fails, only that task is rerun. This
doesn’t match the execution model of distributed ML/DL frameworks that 
 are
typically MPI-based, and rerunning a single task would lead to the 
 entire
system hanging. A whole stage needs to be re-run.

 This is not only useful for integrating with 3rd-party frameworks, but
>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>> ). But we ended up
>>> with some compromised solutions. With the new execution model, we can set
>>> up a hybrid cluster and do all-reduce properly.
>>>
>>>
>> Is there a particular new execution model you are referring to or do we
>> plan to investigate a new execution model ?  For the MPI-like model, we
>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>> and I dont think we have support for that in the scheduler right now.
>>
>>>
 --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com] 
>>>
>>
>>

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Nan Zhu

+1  (non-binding), tested with internal workloads and benchmarks

On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust 
wrote:

> +1 all our pipelines have been running the RC for several days now.
>
> On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
> wrote:
>
>> +1 (non-binding).
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:
>>>
 +1 (binding) in Spark SQL, Core and PySpark.

 Xiao

 2018-02-24 14:49 GMT-08:00 Ricardo Almeida <
 ricardo.alme...@actnowib.com>:

> +1 (non-binding)
>
> same as previous RC
>
> On 24 February 2018 at 11:10, Hyukjin Kwon 
> wrote:
>
>> +1
>>
>> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>>
>>> +1
>>> Tests passed and additionally ran Arrow related tests and did some
>>> perf checks with python 2.7.14
>>>
>>> On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
>>> wrote:
>>>
 Note: given the state of Jenkins I'd love to see Bryan Cutler or
 someone with Arrow experience sign off on this release.

 On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
 wrote:

> +1 (binding)
>
> Passed all the tests, looks good.
>
> Cheng
>
> On 2/23/18 15:00, Holden Karau wrote:
>
> +1 (binding)
> PySpark artifacts install in a fresh Py3 virtual env
>
> On Feb 23, 2018 7:55 AM, "Denny Lee" 
> wrote:
>
>> +1 (non-binding)
>>
>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>> joshgoldsboroughs...@gmail.com> wrote:
>>
>>> New to testing out Spark RCs for the community but I was able to
>>> run some of the basic unit tests without error so for what it's 
>>> worth, I'm
>>> a +1.
>>>
>>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
>>> samee...@apache.org> wrote:
>>>
 Please vote on releasing the following candidate as Apache
 Spark version 2.3.0. The vote is open until Tuesday February 27, 
 2018 at
 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes 
 are cast.


 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see
 https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc5:
 https://github.com/apache/spark/tree/v2.3.0-rc5
 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

 List of JIRA tickets resolved in this release can be found
 here: https://issues.apache.org/jira
 /projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be
 found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapache
 spark-1266/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
 /_site/index.html


 FAQ

 ===
 What are the unresolved issues targeted for 2.3.0?
 ===

 Please see https://s.apache.org/oXKi. At the time of writing,
 there are currently no known release blockers.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by
 taking an existing Spark workload and running on this release 
 candidate,
 then reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install the current RC and see if anything important breaks, in the
 Java/Scala you can add the staging repository to your projects 
 resolvers
 and test with the RC (make sure to clean up the artifact cache 
 before/after
 so you don't end up building with a out of date RC going forward).

 ===
 What s

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu

nvm

On Tue, Jan 9, 2018 at 9:42 AM, Nan Zhu  wrote:

> Hi, all
>
> Out of curious, I just found a bunch of Palantir release under
> org.apache.spark in maven central (https://mvnrepository.com/
> artifact/org.apache.spark/spark-core_2.11)?
>
> Is it on purpose?
>
> Best,
>
> Nan
>
>
>

Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu

Hi, all

Out of curious, I just found a bunch of Palantir release under
org.apache.spark in maven central (
https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?

Is it on purpose?

Best,

Nan

Request for review of SPARK-22599

2017-11-29 Thread Nan Zhu

Hi, all

When we do perf test for Spark, we found that enabling table cache does not
bring the expected speedup comparing to cloud-storage + parquet in many
scenarios. We identified that the performance cost is brought by the fact
that the current InMemoryRelation/InMemorytTableScanExec will traverse the
complete cached table even for the highly selective queries. Comparing the
parquet which utilizes file footer to skip the unnecessary parts of the
file, the execution with cached table is slower.

We have filed JIRA in https://issues.apache.org/jira/browse/SPARK-22599 and
have the corresponding PR in https://github.com/apache/spark/pull/19810
(design doc:
https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing,
which is also linked in JIRA/PR)

Our performance evaluation suggests that we gain up to 41% speedup
comparing to the current implementation (
https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing
)

Please share your thoughts to help us to improve the optimization for the
in-memory table scanning in Spark

Best,

Nan

Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Nan Zhu

I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
blocker

Best,

Nan

On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
wrote:

> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 
>  - --executor-memory flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell - https://github.com/apache/sp
> ark/pull/16906 PR exists but its difficult to add automated tests for
> this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 
>  - ALSModel.predict on Dataframes : potential optimization by not using
> blas - No PR consider re-targeting unless someone has a PR waiting in the
> wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
> blocker in SQL( SPARK-19983
>  ),
>
> Query string is:
>
> affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND
> resolution = Unresolved AND priority = targetPriority
>
> Continui

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Nan Zhu

Congratulations!

On Tue, Jan 24, 2017 at 4:50 PM, Hyukjin Kwon  wrote:

> Congratuation!!
>
> 2017-01-25 9:22 GMT+09:00 Takeshi Yamamuro :
>
>> Congrats!
>>
>> // maropu
>>
>> On Wed, Jan 25, 2017 at 9:20 AM, Kousuke Saruta <
>> saru...@oss.nttdata.co.jp> wrote:
>>
>>> Congrats, Burak and Holden!
>>>
>>> - Kousuke
>>>
>>> On 2017/01/25 6:36, Herman van Hövell tot Westerflier wrote:
>>>
>>> Congrats!
>>>
>>> On Tue, Jan 24, 2017 at 10:20 PM, Felix Cheung <
>>> felixcheun...@hotmail.com> wrote:
>>>
 Congrats and welcome!!

 --
 *From:* Reynold Xin 
 *Sent:* Tuesday, January 24, 2017 10:13:16 AM
 *To:* dev@spark.apache.org
 *Cc:* Burak Yavuz; Holden Karau
 *Subject:* welcoming Burak and Holden as committers

 Hi all,

 Burak and Holden have recently been elected as Apache Spark committers.

 Burak has been very active in a large number of areas in Spark,
 including linear algebra, stats/maths functions in DataFrames, Python/R
 APIs for DataFrames, dstream, and most recently Structured Streaming.

 Holden has been a long time Spark contributor and evangelist. She has
 written a few books on Spark, as well as frequent contributions to the
 Python API to improve its usability and performance.

 Please join me in welcoming the two!

>>>
>>>
>>> --
>>>
>>>
>>> [image: Register today for Spark Summit East 2017!]
>>> 
>>>
>>> Herman van Hövell
>>>
>>> Software Engineer
>>>
>>> Databricks Inc.
>>>
>>> hvanhov...@databricks.com
>>>
>>> +31 6 420 590 27
>>>
>>> databricks.com
>>>
>>> [image: http://databricks.com] 
>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Nan Zhu

Congratulations !

-- 
Nan Zhu
On June 3, 2016 at 10:50:33 PM, Ted Yu (yuzhih...@gmail.com) wrote:

Congratulations, Yanbo.

On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia  wrote:
Hi all,

The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a 
super active contributor in many areas of MLlib. Please join me in welcoming 
Yanbo!

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu

Dear Spark Users and Developers, 

We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy 
to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow 

XGBoost is an optimized distributed gradient boosting library designed to be 
highly efficient, flexible and portable.XGBoost provides a parallel tree 
boosting (also known as GBDT, GBM) that solve many data science problems in a 
fast and accurate way. It has been the winning solution for many machine 
learning scenarios, ranging from Machine Learning Challenges 
(https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)
 to Industrial User Cases 
(https://github.com/dmlc/xgboost/tree/master/demo#usecases) 

XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Spark. (Example: 
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala)

Today, we release the first version of XGBoost4J to bring more choices to the 
Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project!

For more details of distributed XGBoost, you can refer to the recently 
published paper: http://arxiv.org/abs/1603.02754

Best, 

-- 
Nan Zhu
http://codingcat.me

tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Nan Zhu

Hi, all

Anyone noticed that some of the tests just blocked at the test case “don't call 
ssc.stop in listener” in StreamingListenerSuite?

Examples:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console


I originally found it in my own PR, and I thought it is a bug introduced by 
me….but later I found that the tests for the PRs on different things also 
blocked at the same point …

Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021


Best,  

--  
Nan Zhu
http://codingcat.me

Re: A proposal for Spark 2.0

2015-11-12 Thread Nan Zhu

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,  

--  
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:

> Who has the idea of machine learning? Spark missing some features for machine 
> learning, For example, the parameter server.
>  
>  
> > 在 2015年11月12日，05:32，Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> 写道：
> >  
> > I like the idea of popping out Tachyon to an optional component too to 
> > reduce the number of dependencies. In the future, it might even be useful 
> > to do this for Hadoop, but it requires too many API changes to be worth 
> > doing now.
> >  
> > Regarding Scala 2.12, we should definitely support it eventually, but I 
> > don't think we need to block 2.0 on that because it can be added later too. 
> > Has anyone investigated what it would take to run on there? I imagine we 
> > don't need many code changes, just maybe some REPL stuff.
> >  
> > Needless to say, but I'm all for the idea of making "major" releases as 
> > undisruptive as possible in the model Reynold proposed. Keeping everyone 
> > working with the same set of releases is super important.
> >  
> > Matei
> >  
> > > On Nov 11, 2015, at 4:58 AM, Sean Owen  > > (mailto:so...@cloudera.com)> wrote:
> > >  
> > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  > > (mailto:r...@databricks.com)> wrote:
> > > > to the Spark community. A major release should not be very different 
> > > > from a
> > > > minor release and should not be gated based on new features. The main
> > > > purpose of a major release is an opportunity to fix things that are 
> > > > broken
> > > > in the current API and remove certain deprecated APIs (examples follow).
> > > >  
> > >  
> > >  
> > > Agree with this stance. Generally, a major release might also be a
> > > time to replace some big old API or implementation with a new one, but
> > > I don't see obvious candidates.
> > >  
> > > I wouldn't mind turning attention to 2.x sooner than later, unless
> > > there's a fairly good reason to continue adding features in 1.x to a
> > > 1.7 release. The scope as of 1.6 is already pretty darned big.
> > >  
> > >  
> > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, 
> > > > but
> > > > it has been end-of-life.
> > > >  
> > >  
> > >  
> > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > > dropping 2.10. Otherwise it's supported for 2 more years.
> > >  
> > >  
> > > > 2. Remove Hadoop 1 support.
> > >  
> > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > sort of 'alpha' and 'beta' releases) and even <2.6.
> > >  
> > > I'm sure we'll think of a number of other small things -- shading a
> > > bunch of stuff? reviewing and updating dependencies in light of
> > > simpler, more recent dependencies to support from Hadoop etc?
> > >  
> > > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > > Pop out any Docker stuff to another repo?
> > > Continue that same effort for EC2?
> > > Farming out some of the "external" integrations to another repo (?
> > > controversial)
> > >  
> > > See also anything marked version "2+" in JIRA.
> > >  
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org)
> > >  
> >  
> >  
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
>  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu

Thank you, Jie! Very nice work!

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote:

> Correct. Your calculation is right!  
>   
> We have been aware of that kmeans performance drop also. According to our 
> observation, it is caused by some unbalanced executions among different 
> tasks. Even we used the same test data between different versions (i.e., not 
> caused by the data skew).
>   
> And the corresponding run time information has been shared with Xiangrui. Now 
> he is also helping to identify the root cause altogether.  
>   
> Thank you && Best Regards,
> Grace （Huang Jie)
>   
> From: Nan Zhu [mailto:zhunanmcg...@gmail.com]  
> Sent: Friday, June 26, 2015 7:59 PM
> To: Huang, Jie
> Cc: u...@spark.apache.org (mailto:u...@spark.apache.org); 
> dev@spark.apache.org (mailto:dev@spark.apache.org)
> Subject: Re: [SparkScore]Performance portal for Apache Spark - WW26  
>   
> Hi, Jie,  
>  
>   
>  
> Thank you very much for this work! Very helpful!
>  
>   
>  
> I just would like to confirm that I understand the numbers correctly: if we 
> take the running time of 1.2 release as 100s
>  
>   
>  
> 9.1% - means the running time is 109.1 s?
>  
>   
>  
> -4% - means it comes 96s?
>  
>   
>  
> If that’s the true meaning of the numbers, what happened to k-means in 
> HiBench?
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> http://codingcat.me
>  
>   
>  
>  
> On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:
> > Intel® Xeon® CPU E5-2697  
> >  
>  
>   
>  
>  
>  
>

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu

Hi, Jie,  

Thank you very much for this work! Very helpful!

I just would like to confirm that I understand the numbers correctly: if we 
take the running time of 1.2 release as 100s

9.1% - means the running time is 109.1 s?

-4% - means it comes 96s?

If that’s the true meaning of the numbers, what happened to k-means in HiBench?

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:

> Intel® Xeon® CPU E5-2697

Re: Welcoming three new committers

2015-02-03 Thread Nan Zhu

Congratulations!

--  
Nan Zhu
http://codingcat.me


On Tuesday, February 3, 2015 at 8:08 PM, Xuefeng Wu wrote:

> Congratulations！well done.  
>  
> Yours, Xuefeng Wu 吴雪峰 敬上
>  
> > On 2015年2月4日, at 上午6:34, Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> wrote:
> >  
> > Hi all,
> >  
> > The PMC recently voted to add three new committers: Cheng Lian, Joseph 
> > Bradley and Sean Owen. All three have been major contributors to Spark in 
> > the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many 
> > pieces throughout Spark Core. Join me in welcoming them as committers!
> >  
> > Matei
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>

Re: missing document of several messages in actor-based receiver?

2015-01-09 Thread Nan Zhu

Hi,  

I have created the PR for these two issues

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, January 9, 2015 at 7:38 AM, Nan Zhu wrote:

> Thanks, TD,  
>  
> I just created 2 JIRAs to track these,  
>  
> https://issues.apache.org/jira/browse/SPARK-5174
>  
> https://issues.apache.org/jira/browse/SPARK-5175
>  
> Can you help to me assign these two JIRAs to me, and I’d like to submit the 
> PRs
>  
> Best,  
>  
> --  
> Nan Zhu
> http://codingcat.me
>  
>  
> On Friday, January 9, 2015 at 4:25 AM, Tathagata Das wrote:
>  
> > It was not really mean to be hidden. So its essentially the case of the 
> > documentation being insufficient. This code has not gotten much attention 
> > for a while, so it could have a bugs. If you find any and submit a fix for 
> > them, I am happy to take a look!
> >  
> > TD
> >  
> > On Thu, Jan 8, 2015 at 6:33 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> > > Hi, TD and other streaming developers,
> > >  
> > > When I look at the implementation of actor-based receiver 
> > > (ActorReceiver.scala), I found that there are several messages which are 
> > > not mentioned in the document  
> > >  
> > > case props: Props =>
> > > val worker = context.actorOf(props)
> > > logInfo("Started receiver worker at:" + worker.path)
> > > sender ! worker
> > >  
> > > case (props: Props, name: String) =>
> > > val worker = context.actorOf(props, name)
> > > logInfo("Started receiver worker at:" + worker.path)
> > > sender ! worker
> > >  
> > > case _: PossiblyHarmful => hiccups.incrementAndGet()
> > >  
> > > case _: Statistics =>
> > > val workers = context.children
> > > sender ! Statistics(n.get, workers.size, hiccups.get, 
> > > workers.mkString("\n”))
> > >  
> > > Is it hided with intention or incomplete document, or I missed something?
> > > And the handler of these messages are “buggy"? e.g. when we start a new 
> > > worker, we didn’t increase n (counter of children), and n and hiccups are 
> > > unnecessarily set to AtomicInteger ?
> > >  
> > > Best,
> > >  
> > > --  
> > > Nan Zhu
> > > http://codingcat.me
> > >  
> > >  
> >  
> >  
> >  
>

Re: missing document of several messages in actor-based receiver?

2015-01-09 Thread Nan Zhu

Thanks, TD,  

I just created 2 JIRAs to track these,  

https://issues.apache.org/jira/browse/SPARK-5174

https://issues.apache.org/jira/browse/SPARK-5175

Can you help to me assign these two JIRAs to me, and I’d like to submit the PRs

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, January 9, 2015 at 4:25 AM, Tathagata Das wrote:

> It was not really mean to be hidden. So its essentially the case of the 
> documentation being insufficient. This code has not gotten much attention for 
> a while, so it could have a bugs. If you find any and submit a fix for them, 
> I am happy to take a look!
>  
> TD
>  
> On Thu, Jan 8, 2015 at 6:33 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > Hi, TD and other streaming developers,
> >  
> > When I look at the implementation of actor-based receiver 
> > (ActorReceiver.scala), I found that there are several messages which are 
> > not mentioned in the document  
> >  
> > case props: Props =>
> > val worker = context.actorOf(props)
> > logInfo("Started receiver worker at:" + worker.path)
> > sender ! worker
> >  
> > case (props: Props, name: String) =>
> > val worker = context.actorOf(props, name)
> > logInfo("Started receiver worker at:" + worker.path)
> > sender ! worker
> >  
> > case _: PossiblyHarmful => hiccups.incrementAndGet()
> >  
> > case _: Statistics =>
> > val workers = context.children
> > sender ! Statistics(n.get, workers.size, hiccups.get, 
> > workers.mkString("\n”))
> >  
> > Is it hided with intention or incomplete document, or I missed something?
> > And the handler of these messages are “buggy"? e.g. when we start a new 
> > worker, we didn’t increase n (counter of children), and n and hiccups are 
> > unnecessarily set to AtomicInteger ?
> >  
> > Best,
> >  
> > --  
> > Nan Zhu
> > http://codingcat.me
> >  
> >  
>  
>  
>

missing document of several messages in actor-based receiver?

2015-01-08 Thread Nan Zhu

Hi, TD and other streaming developers,

When I look at the implementation of actor-based receiver 
(ActorReceiver.scala), I found that there are several messages which are not 
mentioned in the document  

case props: Props =>
val worker = context.actorOf(props)
logInfo("Started receiver worker at:" + worker.path)
sender ! worker

case (props: Props, name: String) =>
val worker = context.actorOf(props, name)
logInfo("Started receiver worker at:" + worker.path)
sender ! worker

case _: PossiblyHarmful => hiccups.incrementAndGet()

case _: Statistics =>
val workers = context.children
sender ! Statistics(n.get, workers.size, hiccups.get, workers.mkString("\n”))

Is it hided with intention or incomplete document, or I missed something?
And the handler of these messages are “buggy"? e.g. when we start a new worker, 
we didn’t increase n (counter of children), and n and hiccups are unnecessarily 
set to AtomicInteger ?

Best,

--  
Nan Zhu
http://codingcat.me

Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Nan Zhu

BTW, this PR https://github.com/apache/spark/pull/2524 is related to a blocker 
level bug, 

and this is actually close to be merged (have been reviewed for several rounds)

I would appreciated if anyone can continue the process, 

@mateiz 

-- 
Nan Zhu
http://codingcat.me


On Thursday, November 20, 2014 at 10:17 AM, Corey Nolet wrote:

> I was actually about to post this myself- I have a complex join that could
> benefit from something like a GroupComparator vs having to do multiple
> grouyBy operations. This is probably the wrong thread for a full discussion
> on this but I didn't see a JIRA ticket for this or anything similar- any
> reasons why this would not make sense given Spark's design?
> 
> On Thu, Nov 20, 2014 at 9:39 AM, Madhu  (mailto:ma...@madhu.com)> wrote:
> 
> > Thanks Patrick.
> > 
> > I've been testing some 1.2 features, looks good so far.
> > I have some example code that I think will be helpful for certain MR-style
> > use cases (secondary sort).
> > Can I still add that to the 1.2 documentation, or is that frozen at this
> > point?
> > 
> > 
> > 
> > -
> > --
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com (http://Nabble.com).
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> 
> 
> 
>

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Nan Zhu

+1, with a question

Will these maintainers have a cleanup for those pending PRs upon we start to 
apply this model? there are some patches always being there but haven’t been  
merged, some of which are periodically maintained (rebase, ping , etc….), the 
others are just phased out  

Best,  

--  
Nan Zhu


On Wednesday, November 5, 2014 at 8:33 PM, Matei Zaharia wrote:

> BTW, my own vote is obviously +1 (binding).
>  
> Matei
>  
> > On Nov 5, 2014, at 5:31 PM, Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> wrote:
> >  
> > Hi all,
> >  
> > I wanted to share a discussion we've been having on the PMC list, as well 
> > as call for an official vote on it on a public list. Basically, as the 
> > Spark project scales up, we need to define a model to make sure there is 
> > still great oversight of key components (in particular internal 
> > architecture and public APIs), and to this end I've proposed implementing a 
> > maintainer model for some of these components, similar to other large 
> > projects.
> >  
> > As background on this, Spark has grown a lot since joining Apache. We've 
> > had over 80 contributors/month for the past 3 months, which I believe makes 
> > us the most active project in contributors/month at Apache, as well as over 
> > 500 patches/month. The codebase has also grown significantly, with new 
> > libraries for SQL, ML, graphs and more.
> >  
> > In this kind of large project, one common way to scale development is to 
> > assign "maintainers" to oversee key components, where each patch to that 
> > component needs to get sign-off from at least one of its maintainers. Most 
> > existing large projects do this -- at Apache, some large ones with this 
> > model are CloudStack (the second-most active project overall), Subversion, 
> > and Kafka, and other examples include Linux and Python. This is also 
> > by-and-large how Spark operates today -- most components have a de-facto 
> > maintainer.
> >  
> > IMO, adopting this model would have two benefits:
> >  
> > 1) Consistent oversight of design for that component, especially regarding 
> > architecture and API. This process would ensure that the component's 
> > maintainers see all proposed changes and consider them to fit together in a 
> > good way.
> >  
> > 2) More structure for new contributors and committers -- in particular, it 
> > would be easy to look up who’s responsible for each module and ask them for 
> > reviews, etc, rather than having patches slip between the cracks.
> >  
> > We'd like to start with in a light-weight manner, where the model only 
> > applies to certain key components (e.g. scheduler, shuffle) and user-facing 
> > APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand 
> > it if we deem it useful. The specific mechanics would be as follows:
> >  
> > - Some components in Spark will have maintainers assigned to them, where 
> > one of the maintainers needs to sign off on each patch to the component.
> > - Each component with maintainers will have at least 2 maintainers.
> > - Maintainers will be assigned from the most active and knowledgeable 
> > committers on that component by the PMC. The PMC can vote to add / remove 
> > maintainers, and maintained components, through consensus.
> > - Maintainers are expected to be active in responding to patches for their 
> > components, though they do not need to be the main reviewers for them (e.g. 
> > they might just sign off on architecture / API). To prevent inactive 
> > maintainers from blocking the project, if a maintainer isn't responding in 
> > a reasonable time period (say 2 weeks), other committers can merge the 
> > patch, and the PMC will want to discuss adding another maintainer.
> >  
> > If you'd like to see examples for this model, check out the following 
> > projects:
> > - CloudStack: 
> > https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> >  
> > <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>
> >   
> > - Subversion: https://subversion.apache.org/docs/community-guide/roles.html 
> > <https://subversion.apache.org/docs/community-guide/roles.html>
> >  
> > Finally, I wanted to list our current proposal for initial components and 
> > maintainers. It would be good to get feedback on other components we might 
> > add, but please note that personnel discussions (e.g. "I don't think Matei 
> > should maintain *that* component) should only happen o

Re: serialVersionUID incompatible error in class BlockManagerId

2014-10-24 Thread Nan Zhu

According to my experience, there are more issues rather than BlockManager when 
you try to run spark application whose build version is different with your 
cluster….  

I once tried to make jdbc server build with branch-jdbc-1.0 run with a 
branch-1.0 cluster…no workaround exits…just had to replace cluster jar with 
branch-jdbc-1.0 jar file…..

Best,  

--  
Nan Zhu


On Friday, October 24, 2014 at 9:23 PM, Josh Rosen wrote:

> Are all processes (Master, Worker, Executors, Driver) running the same Spark 
> build?  This error implies that you’re seeing protocol / binary 
> incompatibilities between your Spark driver and cluster.
>  
> Spark is API-compatibile across the 1.x series, but we don’t make binary 
> link-level compatibility guarantees: 
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy.  
> This means that your Spark driver’s runtime classpath should use the same 
> version of Spark that’s installed on your cluster.  You can compile against a 
> different API-compatible version of Spark, but the runtime versions must 
> match across all components.
>  
> To fix this issue, I’d check that you’ve run the “package” and “assembly” 
> phases and that your Spark cluster is using this updated version.
>  
> - Josh
>  
> On October 24, 2014 at 6:17:26 PM, Qiuzhuang Lian (qiuzhuang.l...@gmail.com 
> (mailto:qiuzhuang.l...@gmail.com)) wrote:
>  
> Hi,  
>  
> I update git today and when connecting to spark cluster, I got  
> the serialVersionUID incompatible error in class BlockManagerId.  
>  
> Here is the log,  
>  
> Shouldn't we better give BlockManagerId a constant serialVersionUID avoid  
> this?  
>  
> Thanks,  
> Qiuzhuang  
>  
> scala> val rdd = sc.parparallelize(1 to 100014/10/25 09:10:48 ERROR  
> Remoting: org.apache.spark.storage.BlockManagerId; local class  
> incompatible: stream classdesc serialVersionUID = 2439208141545036836,  
> local class serialVersionUID = 4657685702603429489  
> java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId;  
> local class incompatible: stream classdesc serialVersionUID =  
> 2439208141545036836, local class serialVersionUID = 4657685702603429489  
> at  
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)  
> at  
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)  
> at  
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)  
> at  
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)  
> at  
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)  
> at  
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)  
> at  
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)  
> at  
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)  
> at  
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)  
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)  
> at  
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)  
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)  
> at  
> akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)  
> at  
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
>   
> at scala.util.Try$.apply(Try.scala:161)  
> at  
> akka.serialization.Serialization.deserialize(Serialization.scala:98)  
> at  
> akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)  
> at  
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)  
> at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)  
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)  
> at  
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) 
>  
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)  
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)  
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)  
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)  
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)  
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)  
> at  
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   
> at  
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)  
> at  
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   
> at  
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)  
> at  
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>   
&

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu

I agree with Sean

I just compiled spark core successfully with 7u71 in Mac OS X

On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen  wrote:

> Ah, that makes sense.  I had forgotten that there was a JIRA for this:
>
> https://issues.apache.org/jira/browse/SPARK-4021
>
> On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com)
> wrote:
>
> Josh - the errors that broke our build indicated that JDK5 was being
> used. Somehow the upgrade caused our build to use a much older Java
> version. See the JIRA for more details.
>
> On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen 
> wrote:
> > I find it concerning that there's a JDK version that breaks out build,
> since
> > we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from
> the
> > JDK that we used before? Is there an easy way to fix our build so that
> it
> > compiles with 7u71's stricter settings?
> >
> > I'm not sure why the "New" PRB is failing here. It was originally
> created
> > as a clone of the main pull request builder job. I checked the
> configuration
> > history and confirmed that there aren't any settings that we've
> forgotten to
> > copy over (e.g. their configurations haven't diverged), so I'm not sure
> > what's causing this.
> >
> > - Josh
> >
> > On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com)
> wrote:
> >
> > weird.two buildings (one triggered by New, one triggered by Old)
> were
> > executed in the same node, amp-jenkins-slave-01, one compiles, one
> not...
> >
> > Best,
> >
> > --
> > Nan Zhu
> >
> >
> > On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:
> >
> >> seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while
> >> SparkPRBuilder is working fine
> >>
> >> Best,
> >>
> >> --
> >> Nan Zhu
> >>
> >>
> >> On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
> >>
> >> > It's a new pull request builder written by Josh, integrated into our
> >> > state-of-the-art PR dashboard :)
> >> >
> >> > On 10/21/14 9:33 PM, Nan Zhu wrote:
> >> > > just curious...what is this "NewSparkPullRequestBuilder"?
> >> > >
> >> > > Best,
> >> > >
> >> > > --
> >> > > Nan Zhu
> >> > >
> >> > >
> >> > > On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
> >> > >
> >> > > >
> >> > > > Hm, seems that 7u71 comes back again. Observed similar Kinesis
> >> > > > compilation error just now:
> >> > > >
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
> >> > > >
> >> > > >
> >> > > > Checked Jenkins slave nodes, saw /usr/java/latest points to
> >> > > > jdk1.7.0_71. However, /usr/bin/javac -version says:
> >> > > >
> >> > > > >
> >> > > > > Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM
> >> > > > > Corp 2000, 2008. All rights reserved.
> >> > > > >
> >> > > >
> >> > > >
> >> > > > Which JDK is actually used by Jenkins?
> >> > > >
> >> > > >
> >> > > > Cheng
> >> > > >
> >> > > >
> >> > > > On 10/21/14 8:28 AM, shane knapp wrote:
> >> > > >
> >> > > > > ok, so earlier today i installed a 2nd JDK within jenkins
> (7u71),
> >> > > > > which fixed the SparkR build but apparently made Spark itself
> quite unhappy.
> >> > > > > i removed that JDK, triggered a build (
> >> > > > >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
>
> >> > > > > and it compiled kinesis w/o dying a fiery death. apparently
> 7u71 is stricter
> >> > > > > when compiling. sad times. sorry about that! shane On Mon, Oct
> 20, 2014 at
> >> > > > > 5:16 PM, Patrick Wendell  (mailto:
> pwend...@gmail.com)
> >> > > > > wrote:
> >> > > > > > The failure is in the Kinesis compoent, can you reproduce
> this
> >> > > > > > if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20,
> 2014 at 5:08 PM,
&

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu

weird…..two buildings (one triggered by New, one triggered by Old) were 
executed in the same node, amp-jenkins-slave-01, one compiles, one not…

Best,  

--  
Nan Zhu


On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:

> seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while 
> SparkPRBuilder is working fine
>  
> Best,  
>  
> --  
> Nan Zhu
>  
>  
> On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
>  
> > It's a new pull request builder written by Josh, integrated into our 
> > state-of-the-art PR dashboard :)
> >  
> > On 10/21/14 9:33 PM, Nan Zhu wrote:
> > > just curious…what is this “NewSparkPullRequestBuilder”?  
> > >  
> > > Best,  
> > >  
> > > --   
> > > Nan Zhu
> > >  
> > >  
> > > On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
> > >  
> > > >  
> > > > Hm, seems that 7u71 comes back again. Observed similar Kinesis 
> > > > compilation error just now: 
> > > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
> > > >  
> > > >  
> > > > Checked Jenkins slave nodes, saw /usr/java/latest points to 
> > > > jdk1.7.0_71. However, /usr/bin/javac -version says:
> > > >  
> > > > >  
> > > > > Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 
> > > > > 2000, 2008. All rights reserved.
> > > > >  
> > > >  
> > > >  
> > > > Which JDK is actually used by Jenkins?
> > > >  
> > > >  
> > > > Cheng
> > > >  
> > > >  
> > > > On 10/21/14 8:28 AM, shane knapp wrote:
> > > >  
> > > > > ok, so earlier today i installed a 2nd JDK within jenkins (7u71), 
> > > > > which fixed the SparkR build but apparently made Spark itself quite 
> > > > > unhappy. i removed that JDK, triggered a build ( 
> > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
> > > > >  and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
> > > > > stricter when compiling. sad times. sorry about that! shane On Mon, 
> > > > > Oct 20, 2014 at 5:16 PM, Patrick Wendell  
> > > > > (mailto:pwend...@gmail.com) wrote:  
> > > > > > The failure is in the Kinesis compoent, can you reproduce this if 
> > > > > > you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 
> > > > > > 5:08 PM, shane knapp  
> > > > > > (mailto:skn...@berkeley.edu) wrote:  
> > > > > > > hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, 
> > > > > > > Nan Zhu  (mailto:zhunanmcg...@gmail.com) 
> > > > > > > wrote:  
> > > > > > > > yes, I can compile locally, too but it seems that Jenkins is 
> > > > > > > > not happy now... 
> > > > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
> > > > > > > >  All failed to compile Best, -- Nan Zhu On Monday, October 20, 
> > > > > > > > 2014 at 7:56 PM, Ted Yu wrote:  
> > > > > > > > > I performed build on latest master branch but didn't get 
> > > > > > > > > compilation  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > > error.  
> > > > > > > > > FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu 
> > > > > > > > > mailto:zhunanmcg...@gmail.com)  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > > (mailto:zhunanmcg...@gmail.com)> wrote:  
> > > > > > > > > > Hi, I just submitted a patch  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > > https://github.com/apache/spark/pull/2864/files  
> > > > > > > > > > with one line change but the Jenkins told me it's failed to 
> > > > > > > > > > compile on the unrelated  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > > files?  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
> > > > > >   
> > > > > > > > > > Best, Nan  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > > 
> > > >  
> > > >  
> > > >  
> > > >  
> > >  
> > >  
> >  
>

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu

seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while 
SparkPRBuilder is working fine

Best,  

--  
Nan Zhu


On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:

> It's a new pull request builder written by Josh, integrated into our 
> state-of-the-art PR dashboard :)
>  
> On 10/21/14 9:33 PM, Nan Zhu wrote:
> > just curious…what is this “NewSparkPullRequestBuilder”?  
> >  
> > Best,  
> >  
> > --   
> > Nan Zhu
> >  
> >  
> > On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
> >  
> > >  
> > > Hm, seems that 7u71 comes back again. Observed similar Kinesis 
> > > compilation error just now: 
> > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
> > >  
> > >  
> > > Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. 
> > > However, /usr/bin/javac -version says:
> > >  
> > > >  
> > > > Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 
> > > > 2000, 2008. All rights reserved.
> > > >  
> > >  
> > >  
> > > Which JDK is actually used by Jenkins?
> > >  
> > >  
> > > Cheng
> > >  
> > >  
> > > On 10/21/14 8:28 AM, shane knapp wrote:
> > >  
> > > > ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which 
> > > > fixed the SparkR build but apparently made Spark itself quite unhappy. 
> > > > i removed that JDK, triggered a build ( 
> > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
> > > >  and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
> > > > stricter when compiling. sad times. sorry about that! shane On Mon, Oct 
> > > > 20, 2014 at 5:16 PM, Patrick Wendell  
> > > > (mailto:pwend...@gmail.com) wrote:  
> > > > > The failure is in the Kinesis compoent, can you reproduce this if you 
> > > > > build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, 
> > > > > shane knapp  (mailto:skn...@berkeley.edu) wrote: 
> > > > >  
> > > > > > hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, 
> > > > > > Nan Zhu  (mailto:zhunanmcg...@gmail.com) 
> > > > > > wrote:  
> > > > > > > yes, I can compile locally, too but it seems that Jenkins is not 
> > > > > > > happy now... 
> > > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
> > > > > > >  All failed to compile Best, -- Nan Zhu On Monday, October 20, 
> > > > > > > 2014 at 7:56 PM, Ted Yu wrote:  
> > > > > > > > I performed build on latest master branch but didn't get 
> > > > > > > > compilation  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > > error.  
> > > > > > > > FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu 
> > > > > > > > mailto:zhunanmcg...@gmail.com)  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > > (mailto:zhunanmcg...@gmail.com)> wrote:  
> > > > > > > > > Hi, I just submitted a patch  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > > https://github.com/apache/spark/pull/2864/files  
> > > > > > > > > with one line change but the Jenkins told me it's failed to 
> > > > > > > > > compile on the unrelated  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > > files?  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
> > > > >   
> > > > > > > > > Best, Nan  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  
> > >  
> > >  
> > > 
> > >  
> > >  
> > >  
> > >  
> >  
> >  
>

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu

just curious…what is this “NewSparkPullRequestBuilder”?  

Best,  

--  
Nan Zhu


On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:

>  
> Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation 
> error just now: 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
>  
>  
> Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. 
> However, /usr/bin/javac -version says:
>  
> >  
> > Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 2000, 
> > 2008. All rights reserved.
> >  
>  
>  
> Which JDK is actually used by Jenkins?
>  
>  
> Cheng
>  
>  
> On 10/21/14 8:28 AM, shane knapp wrote:
>  
>  
>  
>  
>  
> > ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which 
> > fixed the SparkR build but apparently made Spark itself quite unhappy. i 
> > removed that JDK, triggered a build ( 
> > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
> >  and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
> > stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 
> > 2014 at 5:16 PM, Patrick Wendell  
> > (mailto:pwend...@gmail.com) wrote:  
> > > The failure is in the Kinesis compoent, can you reproduce this if you 
> > > build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, 
> > > shane knapp  (mailto:skn...@berkeley.edu) wrote:  
> > > > hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan 
> > > > Zhu  (mailto:zhunanmcg...@gmail.com) wrote:  
> > > > > yes, I can compile locally, too but it seems that Jenkins is not 
> > > > > happy now... 
> > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ 
> > > > > All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 
> > > > > 7:56 PM, Ted Yu wrote:  
> > > > > > I performed build on latest master branch but didn't get 
> > > > > > compilation  
> > > > > >  
> > > > >  
> > > > > error.  
> > > > > > FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu 
> > > > > > mailto:zhunanmcg...@gmail.com)  
> > > > > >  
> > > > >  
> > > > > (mailto:zhunanmcg...@gmail.com)> wrote:  
> > > > > > > Hi, I just submitted a patch  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > > https://github.com/apache/spark/pull/2864/files  
> > > > > > > with one line change but the Jenkins told me it's failed to 
> > > > > > > compile on the unrelated  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> > > files?  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
> > >   
> > > > > > > Best, Nan  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > >  
> > >  
> > >  
> >  
> >  
> >  
>  
>  
>  
>  
>  
>  
> 
>  
>  
>

Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Nan Zhu

yes, I can compile locally, too 

but it seems that Jenkins is not happy 
now...https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ 

All failed to compile

Best, 

-- 
Nan Zhu


On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:

> I performed build on latest master branch but didn't get compilation error.
> 
> FYI
> 
> On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > Hi,
> > 
> > I just submitted a patch https://github.com/apache/spark/pull/2864/files
> > with one line change
> > 
> > but the Jenkins told me it's failed to compile on the unrelated files?
> > 
> > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
> > 
> > 
> > Best,
> > 
> > Nan
>

something wrong with Jenkins or something untested merged?

2014-10-20 Thread Nan Zhu

Hi,

I just submitted a patch https://github.com/apache/spark/pull/2864/files
with one line change

but the Jenkins told me it's failed to compile on the unrelated files?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console


Best,

Nan

Re: Unit testing Master-Worker Message Passing

2014-10-15 Thread Nan Zhu

I don’t think there are test cases for Worker itself  


You can  


val actorRef = TestActorRef[Master](Props(classOf[Master], ...))(actorSystem) 
actorRef.underlyingActor.receive(Heartbeat)

and use expectMsg to test if Master can reply correct message  by assuming 
Worker is absolutely correct

Then in another test case to test if Worker can send register message to Master 
after receiving Master’s “re-register” instruction, (in this test case assuming 
that the Master is absolutely right)

Best,

--  
Nan Zhu


On Wednesday, October 15, 2014 at 2:04 PM, Matthew Cheah wrote:

> Thanks, the example was helpful.
>  
> However, testing the Worker itself is a lot more complicated than 
> WorkerWatcher, since the Worker class is quite a bit more complex. Are there 
> any tests that inspect the Worker itself?
>  
> Thanks,
>  
> -Matt Cheah
>  
> On Tue, Oct 14, 2014 at 6:40 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > You can use akka testkit
> >  
> > Example:
> >  
> > https://github.com/apache/spark/blob/ef4ff00f87a4e8d38866f163f01741c2673e41da/core/src/test/scala/org/apache/spark/deploy/worker/WorkerWatcherSuite.scala
> >   
> >  
> > --  
> > Nan Zhu
> >  
> >  
> > On Tuesday, October 14, 2014 at 9:17 PM, Matthew Cheah wrote:
> >  
> > > Hi everyone,
> > >  
> > > I’m adding some new message passing between the Master and Worker actors 
> > > in
> > > order to address https://issues.apache.org/jira/browse/SPARK-3736 .
> > >  
> > > I was wondering if these kinds of interactions are tested in the automated
> > > Jenkins test suite, and if so, where I could find some examples to help me
> > > do the same.
> > >  
> > > Thanks!
> > >  
> > > -Matt Cheah  
> >  
>

Re: Unit testing Master-Worker Message Passing

2014-10-14 Thread Nan Zhu

You can use akka testkit

Example:

https://github.com/apache/spark/blob/ef4ff00f87a4e8d38866f163f01741c2673e41da/core/src/test/scala/org/apache/spark/deploy/worker/WorkerWatcherSuite.scala
  

--  
Nan Zhu


On Tuesday, October 14, 2014 at 9:17 PM, Matthew Cheah wrote:

> Hi everyone,
>  
> I’m adding some new message passing between the Master and Worker actors in
> order to address https://issues.apache.org/jira/browse/SPARK-3736 .
>  
> I was wondering if these kinds of interactions are tested in the automated
> Jenkins test suite, and if so, where I could find some examples to help me
> do the same.
>  
> Thanks!
>  
> -Matt Cheah

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu

Great! Congratulations! 

-- 
Nan Zhu


On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:

> Brilliant stuff ! Congrats all :-)
> This is indeed really heartening news !
> 
> Regards,
> Mridul
> 
> 
> On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia  (mailto:matei.zaha...@gmail.com)> wrote:
> > Hi folks,
> > 
> > I interrupt your regularly scheduled user / dev list to bring you some 
> > pretty cool news for the project, which is that we've been able to use 
> > Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
> > faster on 10x fewer nodes. There's a detailed writeup at 
> > http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
> >  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
> > sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
> > 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
> > 
> > I want to thank Reynold Xin for leading this effort over the past few 
> > weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
> > Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
> > providing the machines to make this possible. Finally, this result would of 
> > course not be possible without the many many other contributions, testing 
> > and feature requests from throughout the community.
> > 
> > For an engine to scale from these multi-hour petabyte batch jobs down to 
> > 100-millisecond streaming and interactive queries is quite uncommon, and 
> > it's thanks to all of you folks that we are able to make this happen.
> > 
> > Matei
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
>

Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread Nan Zhu

Just noticed these lines in the jenkins log 

= 
Running Apache RAT checks 
= 
Attempting to fetch rat Launching rat from 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar Error: 
Invalid or corrupt jarfile 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar RAT 
checks passed.

Something wrong?

Best, 

-- 
Nan Zhu

On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote:

> happy monday, everyone!
> 
> remember a few weeks back when i upgraded jenkins, and unwittingly began
> DOSing our system due to massive log spam?
> 
> well, that bug has been fixed w/the current release and i'd like to get our
> logging levels back to something more verbose that we have now.
> 
> downtime will be from 730am-1000am PDT (i do expect this to be done well
> before 1000am)
> 
> the update will be from 1.578 -> 1.582
> 
> changelog here: http://jenkins-ci.org/changelog
> 
> please let me know if there are any questions or concerns. thanks!
> 
> shane, your friendly devops engineer

Re: executorAdded event to DAGScheduler

2014-09-26 Thread Nan Zhu

just a quick reply, we cannot start two executors in the same host for a single 
application in the standard deployment (one worker per machine)  

I’m not sure if it will create an issue when you have multiple workers in the 
same host, as submitWaitingStages is called everywhere and I never try such a 
deployment mode

Best,  

--  
Nan Zhu


On Friday, September 26, 2014 at 8:02 AM, praveen seluka wrote:

> Can someone explain the motivation behind passing executorAdded event to 
> DAGScheduler ? DAGScheduler does submitWaitingStages when executorAdded 
> method is called by TaskSchedulerImpl. I see some issue in the below code,
>  
> TaskSchedulerImpl.scala code
> if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
>  
>  
> Note that executorAdded is called only when there is a new host and not for 
> every new executor. For instance, there can be two executors in the same host 
> and in this case. (But DAGScheduler executorAdded is notified only for new 
> host - so only once in this case). If this is indeed an issue, I would like 
> to submit a patch for this quickly. [cc Andrew Or]
>  
> - Praveen
>  
>

Re: do MIMA checking before all test cases start?

2014-09-24 Thread Nan Zhu

yeah, I tried that, but there is always an issue when I ran dev/mima,  

it always gives me some binary compatibility error on Java API part….

so I have to wait for Jenkins’ result when fixing MIMA issues

--  
Nan Zhu


On Thursday, September 25, 2014 at 12:04 AM, Patrick Wendell wrote:

> Have you considered running the mima checks locally? We prefer people
> not use Jenkins for very frequent checks since it takes resources away
> from other people trying to run tests.
>  
> On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > Hi, all
> >  
> > It seems that, currently, Jenkins makes MIMA checking after all test cases 
> > have finished, IIRC, during the first months we introduced MIMA, we do the 
> > MIMA checking before running test cases
> >  
> > What's the motivation to adjust this behaviour?
> >  
> > In my opinion, if you have some binary compatibility issues, you just need 
> > to do some minor changes, but in the current environment, you can only get 
> > if your change works after all test cases finished (1 hour later...)
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
>  
>  
>

do MIMA checking before all test cases start?

2014-09-24 Thread Nan Zhu

Hi, all  

It seems that, currently, Jenkins makes MIMA checking after all test cases have 
finished, IIRC, during the first months we introduced MIMA, we do the MIMA 
checking before running test cases

What’s the motivation to adjust this behaviour?

In my opinion, if you have some binary compatibility issues, you just need to 
do some minor changes, but in the current environment, you can only get if your 
change works after all test cases finished (1 hour later…)

Best,  

--  
Nan Zhu

Re: A couple questions about shared variables

2014-09-24 Thread Nan Zhu

I proposed a fix https://github.com/apache/spark/pull/2524  

Glad to receive feedbacks  

--  
Nan Zhu


On Tuesday, September 23, 2014 at 9:06 PM, Sandy Ryza wrote:

> Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these 
> nuances.
>  
> -Sandy
>  
> On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > I see, thanks for pointing this out  
> >  
> >  
> > --  
> > Nan Zhu
> >  
> >  
> > On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:
> >  
> > > MapReduce counters do not count duplications.  In MapReduce, if a task 
> > > needs to be re-run, the value of the counter from the second task 
> > > overwrites the value from the first task.
> > >  
> > > -Sandy
> > >  
> > > On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu  > > (mailto:zhunanmcg...@gmail.com)> wrote:
> > > > If you think it as necessary to fix, I would like to resubmit that PR 
> > > > (seems to have some conflicts with the current DAGScheduler)  
> > > >  
> > > > My suggestion is to make it as an option in accumulator, e.g. some 
> > > > algorithms utilizing accumulator for result calculation, it needs a 
> > > > deterministic accumulator, while others implementing something like 
> > > > Hadoop counters may need the current implementation (count everything 
> > > > happened, including the duplications)
> > > >  
> > > > Your thoughts?  
> > > >  
> > > > --  
> > > > Nan Zhu
> > > >  
> > > >  
> > > > On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
> > > >  
> > > > > Hmm, good point, this seems to have been broken by refactorings of 
> > > > > the scheduler, but it worked in the past. Basically the solution is 
> > > > > simple -- in a result stage, we should not apply the update for each 
> > > > > task ID more than once -- the same way we don't call 
> > > > > job.listener.taskSucceeded more than once. Your PR also tried to 
> > > > > avoid this for resubmitted shuffle stages, but I don't think we need 
> > > > > to do that necessarily (though we could).
> > > > >  
> > > > > Matei  
> > > > >  
> > > > > On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> > > > > (mailto:zhunanmcg...@gmail.com)) wrote:
> > > > >  
> > > > > > Hi, Matei,  
> > > > > >  
> > > > > > Can you give some hint on how the current implementation guarantee 
> > > > > > the accumulator is only applied for once?  
> > > > > >  
> > > > > > There is a pending PR trying to achieving this 
> > > > > > (https://github.com/apache/spark/pull/228/files), but from the 
> > > > > > current implementation, I didn’t see this has been done? (maybe I 
> > > > > > missed something)  
> > > > > >  
> > > > > > Best,  
> > > > > >  
> > > > > > --   
> > > > > > Nan Zhu
> > > > > >  
> > > > > >  
> > > > > > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> > > > > >  
> > > > > > > Hey Sandy,
> > > > > > >  
> > > > > > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza 
> > > > > > > (sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com)) wrote: 
> > > > > > >  
> > > > > > >  
> > > > > > > Hey All,   
> > > > > > >  
> > > > > > > A couple questions came up about shared variables recently, and I 
> > > > > > > wanted to   
> > > > > > > confirm my understanding and update the doc to be a little more 
> > > > > > > clear.  
> > > > > > >  
> > > > > > > *Broadcast variables*   
> > > > > > > Now that tasks data is automatically broadcast, the only 
> > > > > > > occasions where it  
> > > > > > > makes sense to explicitly broadcast are:  
> > > > > > > * You want to use a variable from tasks in multiple stages.  
> > > > > > > * You want to have the variable stored on the executors in 
> > > &g

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu

I see, thanks for pointing this out  


--  
Nan Zhu


On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:

> MapReduce counters do not count duplications.  In MapReduce, if a task needs 
> to be re-run, the value of the counter from the second task overwrites the 
> value from the first task.
>  
> -Sandy
>  
> On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > If you think it as necessary to fix, I would like to resubmit that PR 
> > (seems to have some conflicts with the current DAGScheduler)  
> >  
> > My suggestion is to make it as an option in accumulator, e.g. some 
> > algorithms utilizing accumulator for result calculation, it needs a 
> > deterministic accumulator, while others implementing something like Hadoop 
> > counters may need the current implementation (count everything happened, 
> > including the duplications)
> >  
> > Your thoughts?  
> >  
> > --  
> > Nan Zhu
> >  
> >  
> > On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
> >  
> > > Hmm, good point, this seems to have been broken by refactorings of the 
> > > scheduler, but it worked in the past. Basically the solution is simple -- 
> > > in a result stage, we should not apply the update for each task ID more 
> > > than once -- the same way we don't call job.listener.taskSucceeded more 
> > > than once. Your PR also tried to avoid this for resubmitted shuffle 
> > > stages, but I don't think we need to do that necessarily (though we 
> > > could).
> > >  
> > > Matei  
> > >  
> > > On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> > > (mailto:zhunanmcg...@gmail.com)) wrote:
> > >  
> > > > Hi, Matei,  
> > > >  
> > > > Can you give some hint on how the current implementation guarantee the 
> > > > accumulator is only applied for once?  
> > > >  
> > > > There is a pending PR trying to achieving this 
> > > > (https://github.com/apache/spark/pull/228/files), but from the current 
> > > > implementation, I didn’t see this has been done? (maybe I missed 
> > > > something)  
> > > >  
> > > > Best,  
> > > >  
> > > > --   
> > > > Nan Zhu
> > > >  
> > > >  
> > > > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> > > >  
> > > > > Hey Sandy,
> > > > >  
> > > > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza 
> > > > > (sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com)) wrote:  
> > > > >  
> > > > > Hey All,   
> > > > >  
> > > > > A couple questions came up about shared variables recently, and I 
> > > > > wanted to   
> > > > > confirm my understanding and update the doc to be a little more 
> > > > > clear.  
> > > > >  
> > > > > *Broadcast variables*   
> > > > > Now that tasks data is automatically broadcast, the only occasions 
> > > > > where it  
> > > > > makes sense to explicitly broadcast are:  
> > > > > * You want to use a variable from tasks in multiple stages.  
> > > > > * You want to have the variable stored on the executors in 
> > > > > deserialized  
> > > > > form.  
> > > > > * You want tasks to be able to modify the variable and have those  
> > > > > modifications take effect for other tasks running on the same 
> > > > > executor  
> > > > > (usually a very bad idea).  
> > > > >  
> > > > > Is that right?   
> > > > > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> > > > > matters. (We might later factor tasks in a different way to avoid 2, 
> > > > > but it's hard due to things like Hadoop JobConf objects in the tasks).
> > > > >  
> > > > >  
> > > > > *Accumulators*   
> > > > > Values are only counted for successful tasks. Is that right? KMeans 
> > > > > seems  
> > > > > to use it in this way. What happens if a node goes away and 
> > > > > successful  
> > > > > tasks need to be resubmitted? Or the stage runs again because a 
> > > > > different  
> > > > > job needed it.  
> > > > > Accumulators are guaranteed to give a deterministic result if you 
> > > > > only increment them in actions. For each result stage, the 
> > > > > accumulator's update from each task is only applied once, even if 
> > > > > that task runs multiple times. If you use accumulators in 
> > > > > transformations (i.e. in a stage that may be part of multiple jobs), 
> > > > > then you may see multiple updates, from each run. This is kind of 
> > > > > confusing but it was useful for people who wanted to use these for 
> > > > > debugging.
> > > > >  
> > > > > Matei  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > thanks,   
> > > > > Sandy  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> >  
>

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu

If you think it as necessary to fix, I would like to resubmit that PR (seems to 
have some conflicts with the current DAGScheduler)  

My suggestion is to make it as an option in accumulator, e.g. some algorithms 
utilizing accumulator for result calculation, it needs a deterministic 
accumulator, while others implementing something like Hadoop counters may need 
the current implementation (count everything happened, including the 
duplications)

Your thoughts?  

--  
Nan Zhu


On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:

> Hmm, good point, this seems to have been broken by refactorings of the 
> scheduler, but it worked in the past. Basically the solution is simple -- in 
> a result stage, we should not apply the update for each task ID more than 
> once -- the same way we don't call job.listener.taskSucceeded more than once. 
> Your PR also tried to avoid this for resubmitted shuffle stages, but I don't 
> think we need to do that necessarily (though we could).
>  
> Matei  
>  
> On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> (mailto:zhunanmcg...@gmail.com)) wrote:
>  
> > Hi, Matei,  
> >  
> > Can you give some hint on how the current implementation guarantee the 
> > accumulator is only applied for once?  
> >  
> > There is a pending PR trying to achieving this 
> > (https://github.com/apache/spark/pull/228/files), but from the current 
> > implementation, I didn’t see this has been done? (maybe I missed something) 
> >  
> >  
> > Best,  
> >  
> > --   
> > Nan Zhu
> >  
> >  
> > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> >  
> > > Hey Sandy,
> > >  
> > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
> > > (mailto:sandy.r...@cloudera.com)) wrote:  
> > >  
> > > Hey All,   
> > >  
> > > A couple questions came up about shared variables recently, and I wanted 
> > > to   
> > > confirm my understanding and update the doc to be a little more clear.  
> > >  
> > > *Broadcast variables*   
> > > Now that tasks data is automatically broadcast, the only occasions where 
> > > it  
> > > makes sense to explicitly broadcast are:  
> > > * You want to use a variable from tasks in multiple stages.  
> > > * You want to have the variable stored on the executors in deserialized  
> > > form.  
> > > * You want tasks to be able to modify the variable and have those  
> > > modifications take effect for other tasks running on the same executor  
> > > (usually a very bad idea).  
> > >  
> > > Is that right?   
> > > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> > > matters. (We might later factor tasks in a different way to avoid 2, but 
> > > it's hard due to things like Hadoop JobConf objects in the tasks).
> > >  
> > >  
> > > *Accumulators*   
> > > Values are only counted for successful tasks. Is that right? KMeans seems 
> > >  
> > > to use it in this way. What happens if a node goes away and successful  
> > > tasks need to be resubmitted? Or the stage runs again because a different 
> > >  
> > > job needed it.  
> > > Accumulators are guaranteed to give a deterministic result if you only 
> > > increment them in actions. For each result stage, the accumulator's 
> > > update from each task is only applied once, even if that task runs 
> > > multiple times. If you use accumulators in transformations (i.e. in a 
> > > stage that may be part of multiple jobs), then you may see multiple 
> > > updates, from each run. This is kind of confusing but it was useful for 
> > > people who wanted to use these for debugging.
> > >  
> > > Matei  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > thanks,   
> > > Sandy  
> > >  
> > >  
> > >  
> >  
> >

Re: A couple questions about shared variables

2014-09-21 Thread Nan Zhu

Hi, Matei,   

Can you give some hint on how the current implementation guarantee the 
accumulator is only applied for once?

There is a pending PR trying to achieving this 
(https://github.com/apache/spark/pull/228/files), but from the current 
implementation, I didn’t see this has been done? (maybe I missed something)

Best,  

--  
Nan Zhu


On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:

> Hey Sandy,
>  
> On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
> (mailto:sandy.r...@cloudera.com)) wrote:
>  
> Hey All,  
>  
> A couple questions came up about shared variables recently, and I wanted to  
> confirm my understanding and update the doc to be a little more clear.  
>  
> *Broadcast variables*  
> Now that tasks data is automatically broadcast, the only occasions where it  
> makes sense to explicitly broadcast are:  
> * You want to use a variable from tasks in multiple stages.  
> * You want to have the variable stored on the executors in deserialized  
> form.  
> * You want tasks to be able to modify the variable and have those  
> modifications take effect for other tasks running on the same executor  
> (usually a very bad idea).  
>  
> Is that right?  
> Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> matters. (We might later factor tasks in a different way to avoid 2, but it's 
> hard due to things like Hadoop JobConf objects in the tasks).
>  
>  
> *Accumulators*  
> Values are only counted for successful tasks. Is that right? KMeans seems  
> to use it in this way. What happens if a node goes away and successful  
> tasks need to be resubmitted? Or the stage runs again because a different  
> job needed it.  
> Accumulators are guaranteed to give a deterministic result if you only 
> increment them in actions. For each result stage, the accumulator's update 
> from each task is only applied once, even if that task runs multiple times. 
> If you use accumulators in transformations (i.e. in a stage that may be part 
> of multiple jobs), then you may see multiple updates, from each run. This is 
> kind of confusing but it was useful for people who wanted to use these for 
> debugging.
>  
> Matei
>  
>  
>  
>  
>  
> thanks,  
> Sandy  
>  
>

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) 
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)



--  
Nan Zhu


On Thursday, September 11, 2014 at 10:42 AM, Nan Zhu wrote:

> Hi,   
>  
> Can you attach more logs to see if there is some entry from ContextCleaner?
>  
> I met very similar issue before…but haven’t get resolved  
>  
> Best,  
>  
> --  
> Nan Zhu
>  
>  
> On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:
>  
> > Dear All,  
> >  
> > Not sure if this is a false alarm. But wanted to raise to this to 
> > understand what is happening.  
> >  
> > I am testing the Kafka Receiver which I have written 
> > (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
> > level Kafka Consumer implemented custom Receivers for every Kafka topic 
> > partitions and pulling data in parallel. Individual streams from all topic 
> > partitions are then merged to create Union stream which used for further 
> > processing.
> >  
> > The custom Receiver working fine in normal load with no issues. But when I 
> > tested this with huge amount of backlog messages from Kafka ( 50 million + 
> > messages), I see couple of major issue in Spark Streaming. Wanted to get 
> > some opinion on this
> >  
> > I am using latest Spark 1.1 taken from the source and built it. Running in 
> > Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
> >  
> > Below are two main question I have..
> >  
> > 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
> > with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
> > performing the Receiving task and hardly schedule any processing task. Can 
> > you let me if this is expected ? If there is large

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

Hi,   

Can you attach more logs to see if there is some entry from ContextCleaner?

I met very similar issue before…but haven’t get resolved  

Best,  

--  
Nan Zhu


On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:

> Dear All,  
>  
> Not sure if this is a false alarm. But wanted to raise to this to understand 
> what is happening.  
>  
> I am testing the Kafka Receiver which I have written 
> (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
> level Kafka Consumer implemented custom Receivers for every Kafka topic 
> partitions and pulling data in parallel. Individual streams from all topic 
> partitions are then merged to create Union stream which used for further 
> processing.
>  
> The custom Receiver working fine in normal load with no issues. But when I 
> tested this with huge amount of backlog messages from Kafka ( 50 million + 
> messages), I see couple of major issue in Spark Streaming. Wanted to get some 
> opinion on this
>  
> I am using latest Spark 1.1 taken from the source and built it. Running in 
> Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
>  
> Below are two main question I have..
>  
> 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
> with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
> performing the Receiving task and hardly schedule any processing task. Can 
> you let me if this is expected ? If there is large backlog, Spark will take 
> long time pulling them . Why Spark not doing any processing ? Is it because 
> of resource limitation ( say all cores are busy puling ) or it is by design ? 
> I am setting the executor-memory to 10G and driver-memory to 4G .
>  
> 2. This issue seems to be more serious. I have attached the Driver trace with 
> this email. What I can see very frequently Block are selected to be 
> Removed...This kind of entries are all over the place. But when a Block is 
> removed , below problem happen May be this issue cause the issue 1 that 
> no Jobs are getting processed ..
>  
>  
> INFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
> INFO : org.apache.spark.storage.BlockManager - Dropping block 
> input-0-1410443074600 from memory
> INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of 
> size 12651900 dropped from memory (free 21220667)
> INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
> input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 
> (http://ip-10-252-5-113.asskickery.us:53752) in memory (size: 12.1 MB, free: 
> 100.6 MB)
>  
> ...
>  
> INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
> input-0-1410443074600 on ip-10-252-5-62.asskickery.us:37033 
> (http://ip-10-252-5-62.asskickery.us:37033) in memory (size: 12.1 MB, free: 
> 154.6 MB)
> ..
>  
>  
> WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 7.0 
> (TID 118, ip-10-252-5-62.asskickery.us 
> (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
> compute split, block input-0-1410443074600 not found
>  
> ...
>  
> INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 7.0 
> (TID 126) on executor ip-10-252-5-62.asskickery.us 
> (http://ip-10-252-5-62.asskickery.us): java.lang.Exception (Could not compute 
> split, block input-0-1410443074600 not found) [duplicate 1]
>  
>  
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 
> (TID 139, ip-10-252-5-62.asskickery.us 
> (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
> compute split, block input-0-1410443074600 not found
> org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
>  
>

Re: jenkins failed all tests?

2014-09-07 Thread Nan Zhu

It seems that I’m not the only one   

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

Best,  

--  
Nan Zhu


On Sunday, September 7, 2014 at 4:52 PM, Nan Zhu wrote:

> Hi, Sean,  
>  
> Thanks for the reply
>  
> Here are the updated files:
>  
> https://github.com/apache/spark/pull/2312/files  
>  
> just two md files...
>  
> Best,  
>  
> --  
> Nan Zhu
>  
>  
> On Sunday, September 7, 2014 at 4:30 PM, Sean Owen wrote:
>  
> > It would help to point to your change. Are you sure it was only docs
> > and are you sure you're rebased, submitting against the right branch?
> > Jenkins is saying you are changing public APIs; it's not reporting
> > test failures. But it could well be a test/Jenkins problem.
> >  
> > On Sun, Sep 7, 2014 at 8:39 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> > > Hi, all
> > >  
> > > I just modified some document,
> > >  
> > > but still failed to pass tests?
> > >  
> > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull
> > >  
> > > Anyone can look at the problem?
> > >  
> > > Best,
> > >  
> > > --
> > > Nan Zhu
> > >  
> >  
> >  
> >  
> >  
>  
>

Re: jenkins failed all tests?

2014-09-07 Thread Nan Zhu

Hi, Sean, 

Thanks for the reply

Here are the updated files:

https://github.com/apache/spark/pull/2312/files 

just two md files...

Best, 

-- 
Nan Zhu


On Sunday, September 7, 2014 at 4:30 PM, Sean Owen wrote:

> It would help to point to your change. Are you sure it was only docs
> and are you sure you're rebased, submitting against the right branch?
> Jenkins is saying you are changing public APIs; it's not reporting
> test failures. But it could well be a test/Jenkins problem.
> 
> On Sun, Sep 7, 2014 at 8:39 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > Hi, all
> > 
> > I just modified some document,
> > 
> > but still failed to pass tests?
> > 
> > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull
> > 
> > Anyone can look at the problem?
> > 
> > Best,
> > 
> > --
> > Nan Zhu
> > 
> 
> 
>

jenkins failed all tests?

2014-09-07 Thread Nan Zhu

Hi, all 

I just modified some document, 

but still failed to pass tests?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull

Anyone can look at the problem?

Best, 

-- 
Nan Zhu

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Nan Zhu

+1 tested thrift server with our in-house application, everything works fine 

-- 
Nan Zhu


On Wednesday, September 3, 2014 at 4:43 PM, Matei Zaharia wrote:

> +1
> 
> Matei
> 
> On September 3, 2014 at 12:24:32 PM, Cheng Lian (lian.cs@gmail.com 
> (mailto:lian.cs@gmail.com)) wrote:
> 
> +1. 
> 
> Tested locally on OSX 10.9, built with Hadoop 2.4.1 
> 
> - Checked Datanucleus jar files 
> - Tested Spark SQL Thrift server and CLI under local mode and standalone 
> cluster against MySQL backed metastore 
> 
> 
> 
> On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen  (mailto:rosenvi...@gmail.com)> wrote: 
> 
> > +1. Tested on Windows and EC2. Confirmed that the EC2 pvm->hvm switch 
> > fixed the SPARK-3358 regression. 
> > 
> > 
> > On September 3, 2014 at 10:33:45 AM, Marcelo Vanzin (van...@cloudera.com 
> > (mailto:van...@cloudera.com)) 
> > wrote: 
> > 
> > +1 (non-binding) 
> > 
> > - checked checksums of a few packages 
> > - ran few jobs against yarn client/cluster using hadoop2.3 package 
> > - played with spark-shell in yarn-client mode 
> > 
> > On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell  > (mailto:pwend...@gmail.com)> 
> > wrote: 
> > > Please vote on releasing the following candidate as Apache Spark version 
> > 
> > 1.1.0! 
> > > 
> > > The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd): 
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
> >  
> > > 
> > > The release files, including signatures, digests, etc. can be found at: 
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc4/ 
> > > 
> > > Release artifacts are signed with the following key: 
> > > https://people.apache.org/keys/committer/pwendell.asc 
> > > 
> > > The staging repository for this release can be found at: 
> > > https://repository.apache.org/content/repositories/orgapachespark-1031/ 
> > > 
> > > The documentation corresponding to this release can be found at: 
> > > http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/ 
> > > 
> > > Please vote on releasing this package as Apache Spark 1.1.0! 
> > > 
> > > The vote is open until Saturday, September 06, at 08:30 UTC and passes if 
> > > a majority of at least 3 +1 PMC votes are cast. 
> > > 
> > > [ ] +1 Release this package as Apache Spark 1.1.0 
> > > [ ] -1 Do not release this package because ... 
> > > 
> > > To learn more about Apache Spark, please see 
> > > http://spark.apache.org/ 
> > > 
> > > == Regressions fixed since RC3 == 
> > > SPARK-3332 - Issue with tagging in EC2 scripts 
> > > SPARK-3358 - Issue with regression for m3.XX instances 
> > > 
> > > == What justifies a -1 vote for this release? == 
> > > This vote is happening very late into the QA period compared with 
> > > previous votes, so -1 votes should only occur for significant 
> > > regressions from 1.0.2. Bugs already present in 1.0.X will not block 
> > > this release. 
> > > 
> > > == What default changes should I be aware of? == 
> > > 1. The default value of "spark.io.compression.codec" is now "snappy" 
> > > --> Old behavior can be restored by switching to "lzf" 
> > > 
> > > 2. PySpark now performs external spilling during aggregations. 
> > > --> Old behavior can be restored by setting "spark.shuffle.spill" to 
> > > 
> > 
> > "false". 
> > > 
> > > 3. PySpark uses a new heuristic for determining the parallelism of 
> > > shuffle operations. 
> > > --> Old behavior can be restored by setting 
> > > "spark.default.parallelism" to the number of cores in the cluster. 
> > > 
> > > - 
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org) 
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org) 
> > > 
> > 
> > 
> > 
> > 
> > -- 
> > Marcelo 
> > 
> > - 
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org) 
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org) 
> > 
> 
> 
>

Re: new JDBC server test cases seems failed ?

2014-07-27 Thread Nan Zhu

it’s 20 minutes ago  

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17259/consoleFull
  

--  
Nan Zhu


On Sunday, July 27, 2014 at 8:53 PM, Michael Armbrust wrote:

> How recent is this? We've already reverted this patch once due to failing
> tests. It would be helpful to include a link to the failed build. If its
> failing again we'll have to revert again.
>  
>  
> On Sun, Jul 27, 2014 at 5:26 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
>  
> > Hi, all
> >  
> > It seems that the JDBC test cases are failed unexpectedly in Jenkins?
> >  
> >  
> > [info] - test query execution against a Hive Thrift server *** FAILED ***
> > [info] java.sql.SQLException: Could not open connection to
> > jdbc:hive2://localhost:45518/: java.net.ConnectException: Connection
> > refused [info] at
> > org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
> > [info] at
> > org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:123) [info]
> > at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
> > java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
> > java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
> > [info] at
> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
> > [info] at
> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110)
> > [info] at org.apache.spark.sql.hive.thri
> > ftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
> > [info] at
> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
> > [info] ... [info] Cause: org.apache.thrift.transport.TTransportException:
> > java.net.ConnectException: Connection refused [info] at
> > org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at
> > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
> > [info] at
> > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> > [info] at
> > org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
> > [info] at
> > org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:123) [info]
> > at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
> > java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
> > java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
> > org.apache.spark.sql.hive.thriftserver.H
> > iveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
> > [info] at
> > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
> > [info] ... [info] Cause: java.net.ConnectException: Connection refused
> > [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at
> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> > [info] at
> > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> > [info] at
> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> > [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [info]
> > at java.net.Socket.connect(Socket.java:579) [info] at
> > org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at
> > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
> > [info] at
> > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> > [info] at org.apache.hive.jdbc.HiveConn
> > ection.openTransport(HiveConnection.java:144) [info] ... [info] CliSuite:
> > Executing: create table hive_test1(key int, val string);, expecting output:
> > OK [warn] four warnings found [warn] Note:
> > /home/jenkins/workspace/SparkPullRequestBuilder@4/core/src/test/java/org/apache/spark/JavaAPISuite.java
> > uses or overrides a deprecated API. [warn] Note: Recompile with
> > -Xlint:deprecation for details. [info] - simple commands *** FAILED ***
> > [info] java.lang.AssertionError: assertion failed: Didn't find "OK" in the
> > output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at
> > org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70)
> > [info] at
> > org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery

new JDBC server test cases seems failed ?

2014-07-27 Thread Nan Zhu

 
tests run: 2 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 0, 
failed 2, canceled 0, ignored 0, pending 0 [info] *** 2 TESTS FAILED ***

Best, 

-- 
Nan Zhu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

Re: branch-1.1 will be cut on Friday

2014-07-27 Thread Nan Zhu

Good news, we will see the official version containing JDBC in very soon! 

Also, I have several pending PRs, can anyone continue the review process in 
this week?

Avoid overwriting already-set SPARK_HOME in spark-submit: 
https://github.com/apache/spark/pull/1331

fix locality inversion bug in TaskSetManager: 
https://github.com/apache/spark/pull/1313 (Matei and Mridulm are working on it)

Allow multiple executor per worker in Standalone mode: 
https://github.com/apache/spark/pull/731 

Ensure actor is self-contained  in DAGScheduler: 
https://github.com/apache/spark/pull/637

Best, 

-- 
Nan Zhu


On Sunday, July 27, 2014 at 2:31 PM, Patrick Wendell wrote:

> Hey All,
> 
> Just a heads up, we'll cut branch-1.1 on this Friday, August 1st. Once
> the release branch is cut we'll start community QA and go into the
> normal triage process for merging patches into that branch.
> 
> For Spark core, we'll be conservative in merging things past the
> freeze date (e.g. high priority fixes) to ensure a healthy amount of
> time for testing. A key focus of this release in core is improving
> overall stability and resilience of Spark core.
> 
> As always, I'll encourage of committers/contributors to help review
> patches this week to so we can get as many things in as possible.
> People have been quite active recently, which is great!
> 
> Good luck!
> - Patrick
> 
>

spark.executor.memory is not applicable when running unit test in Jenkins?

2014-07-21 Thread Nan Zhu

Hi, all  

I’m running some unit tests for my Spark applications in Jenkins

it seems that even I set spark.executor.memory to 5g, the value I got with 
Runtime.getRuntime.maxMemory is still around 1G

Is it saying that Jenkins limit the process to use no more than 1G (by 
default)? how to change that?

Thanks,


--  
Nan Zhu

Re: Pull requests will be automatically linked to JIRA when submitted

2014-07-20 Thread Nan Zhu

Awesome!

On Saturday, July 19, 2014, Patrick Wendell  wrote:

> Just a small note, today I committed a tool that will automatically
> mirror pull requests to JIRA issues, so contributors will no longer
> have to manually post a pull request on the JIRA when they make one.
>
> It will create a "link" on the JIRA and also make a comment to trigger
> an e-mail to people watching.
>
> This should make some things easier, such as avoiding accidental
> duplicate effort on the same JIRA.
>
> - Patrick
>

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu

I resolved the issue by setting an internal maven repository to contain the 
Spark-1.0.1 jar compiled from branch-0.1-jdbc and replacing the dependency to 
the central repository with our own repository 

I believe there should be some more lightweight way

Best, 

-- 
Nan Zhu


On Monday, July 14, 2014 at 6:36 AM, Nan Zhu wrote:

> Ah, sorry, sorry
> 
> It's executorState under deploy package
> 
> On Monday, July 14, 2014, Patrick Wendell  (mailto:pwend...@gmail.com)> wrote:
> > > 1. The first error I met is the different SerializationVersionUID in 
> > > ExecuterStatus
> > >
> > > I resolved by explicitly declare SerializationVersionUID in 
> > > ExecuterStatus.scala and recompile branch-0.1-jdbc
> > >
> > 
> > I don't think there is a class in Spark named ExecuterStatus (sic) ...
> > or ExecutorStatus. Is this a class you made?

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu

Ah, sorry, sorry

It's executorState under deploy package

On Monday, July 14, 2014, Patrick Wendell  wrote:

> > 1. The first error I met is the different SerializationVersionUID in
> ExecuterStatus
> >
> > I resolved by explicitly declare SerializationVersionUID in
> ExecuterStatus.scala and recompile branch-0.1-jdbc
> >
>
> I don't think there is a class in Spark named ExecuterStatus (sic) ...
> or ExecutorStatus. Is this a class you made?
>

how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-13 Thread Nan Zhu

Hi, all  

I’m trying the JDBC server, so the cluster is running the version compiled from 
branch-0.1-jdbc  

Unfortunately (and as expected), it cannot run the programs compiled with the 
dependency on spark 1.0 (i.e. download from maven)

1. The first error I met is the different SerializationVersionUID in 
ExecuterStatus  

I resolved by explicitly declare SerializationVersionUID in 
ExecuterStatus.scala and recompile branch-0.1-jdbc

2. Then I start the program compiled with spark-1.0, what I met is  

14/07/13 05:08:11 WARN AppClient$ClientActor: Could not connect to 
akka.tcp://sparkMaster@172.31.*.*:*: java.util.NoSuchElementException: key not 
found: 6  
14/07/13 05:08:11 WARN AppClient$ClientActor: Connection to 
akka.tcp://sparkMaster@172.31.*.*:* failed; waiting for master to reconnect...



I don’t understand how "key not found: 6” comes



Also I tried to start JDBC server with spark-1.0 cluster, after resolving 
different SerializationVersionUID, what I met is that when I use beeline to run 
“show tables;”, it shows some executors get lost and tasks failed for unknown 
reason

Anyone can give some suggestions on how to make spark-1.0 cluster work with 
JDBC?  

(maybe I need to have a internal maven repo and change all spark dependency to 
that?)

Best,

--  
Nan Zhu

assign SPARK-2126 to me?

2014-06-19 Thread Nan Zhu

Hi, all

Any admin can assign this issue 
https://issues.apache.org/jira/browse/SPARK-2126 to me?

I have started working on this

Thanks,

-- 
Nan Zhu

anyone can mark this issue as resolved?

2014-06-17 Thread Nan Zhu

Hi, 

Just found it occasionally 

https://issues.apache.org/jira/browse/SPARK-1471 

Best, 

-- 
Nan Zhu

review of two PRs

2014-06-14 Thread Nan Zhu

Hi, all 

Any admin wants to review these two PRs?

https://github.com/apache/spark/pull/637 (to make DAGScheduler self-contained)

https://github.com/apache/spark/pull/731 (enable multiple executors per Worker)

Thanks 

-- 
Nan Zhu

Re: Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Nan Zhu

I think I lost that permission too?  

Patrick once helped to recover the permission, but I lost that permission again?

username is CodingCat, or Nan Zhu (I’m not sure which one you use when doing 
this)?

Best,  

--  
Nan Zhu


On Tuesday, June 3, 2014 at 2:39 PM, Henry Saputra wrote:

> Thanks Matei!
>  
> - Henry
>  
> On Tue, Jun 3, 2014 at 11:36 AM, Matei Zaharia  (mailto:matei.zaha...@gmail.com)> wrote:
> > Done. Looks like this was lost in the JIRA import.
> >  
> > Matei
> >  
> > On Jun 3, 2014, at 11:33 AM, Henry Saputra  > (mailto:henry.sapu...@gmail.com)> wrote:
> >  
> > > Hi,
> > >  
> > > Could someone with right karma kindly add my username (hsaputra) to
> > > Spark's contributor list?
> > >  
> > > I was added before but somehow now I can no longer assign ticket to
> > > myself nor update tickets I am working on.
> > >  
> > >  
> > > Thanks,
> > >  
> > > - Henry

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu

If local[2] is expected, then the streaming doc is actually misleading? 

as the given example is 

import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
// Create a StreamingContext with a local master
val ssc = new StreamingContext("local", "NetworkWordCount", Seconds(1))

http://spark.apache.org/docs/latest/streaming-programming-guide.html

I created a JIRA and a PR 

https://github.com/apache/spark/pull/924 

-- 
Nan Zhu


On Friday, May 30, 2014 at 1:53 PM, Patrick Wendell wrote:

> Yeah - Spark streaming needs at least two threads to run. I actually
> thought we warned the user if they only use one (@tdas?) but the
> warning might not be working correctly - or I'm misremembering.
> 
> On Fri, May 30, 2014 at 6:38 AM, Sean Owen  (mailto:so...@cloudera.com)> wrote:
> > Thanks Nan, that does appear to fix it. I was using "local". Can
> > anyone say whether that's to be expected or whether it could be a bug
> > somewhere?
> > 
> > On Fri, May 30, 2014 at 2:42 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> > > Hi, Sean
> > > 
> > > I was in the same problem
> > > 
> > > but when I changed MASTER="local" to MASTER="local[2]"
> > > 
> > > everything back to the normal
> > > 
> > > Hasn't get a chance to ask here
> > > 
> > > Best,
> > > 
> > > --
> > > Nan Zhu
> > > 
> > 
> > 
> 
> 
>

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu

Hi, Sean   

I was in the same problem

but when I changed MASTER=“local” to MASTER=“local[2]”

everything back to the normal

Hasn’t get a chance to ask here

Best,  

--  
Nan Zhu


On Friday, May 30, 2014 at 9:09 AM, Sean Owen wrote:

> Guys I'm struggling to debug some strange behavior in a simple
> Streaming + Java + Kafka example -- in fact, a simplified version of
> JavaKafkaWordcount, that is just calling print() on a sequence of
> messages.
>  
> Data is flowing, but it only appears to work for a few periods --
> sometimes 0 -- before ceasing to call any actions. Sorry for lots of
> log posting but it may illustrate to someone who knows this better
> what is happening:
>  
>  
>  
> Key action in the logs seems to be as follows -- it works a few times:
>  
> ...
> 2014-05-30 13:53:50 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
> 2014-05-30 13:53:50 INFO JobScheduler:58 - Added jobs for time 140145443 
> ms
> ---
> Time: 140145443 ms
> ---
>  
> 2014-05-30 13:53:50 INFO JobScheduler:58 - Starting job streaming job
> 140145443 ms.0 from job set of time 140145443 ms
> 2014-05-30 13:53:50 INFO JobScheduler:58 - Finished job streaming job
> 140145443 ms.0 from job set of time 140145443 ms
> 2014-05-30 13:53:50 INFO JobScheduler:58 - Total delay: 0.004 s for
> time 140145443 ms (execution: 0.000 s)
> 2014-05-30 13:53:50 INFO MappedRDD:58 - Removing RDD 2 from persistence list
> 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 2
> 2014-05-30 13:53:50 INFO BlockRDD:58 - Removing RDD 1 from persistence list
> 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 1
> 2014-05-30 13:53:50 INFO KafkaInputDStream:58 - Removing blocks of
> RDD BlockRDD[1] at BlockRDD at ReceiverInputDStream.scala:69 of time
> 140145443 ms
> 2014-05-30 13:54:00 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
> 2014-05-30 13:54:00 INFO JobScheduler:58 - Added jobs for time 140145444 
> ms
> ...
>  
>  
> Then works with some additional, different output in the logs -- here
> you see output is flowing too:
>  
> ...
> 2014-05-30 13:54:20 INFO ReceiverTracker:58 - Stream 0 received 2 blocks
> 2014-05-30 13:54:20 INFO JobScheduler:58 - Added jobs for time 140145446 
> ms
> 2014-05-30 13:54:20 INFO JobScheduler:58 - Starting job streaming job
> 140145446 ms.0 from job set of time 140145446 ms
> 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
> DStream.scala:593
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 1 (take at
> DStream.scala:593) with 1 output partitions (allowLocal=true)
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 1(take
> at DStream.scala:593)
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
> partition locally
> 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
> input-0-1401454458400 locally
> 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
> DStream.scala:593, took 0.007007 s
> 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
> DStream.scala:593
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 2 (take at
> DStream.scala:593) with 1 output partitions (allowLocal=true)
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 2(take
> at DStream.scala:593)
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
> 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
> partition locally
> 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
> input-0-1401454459400 locally
> 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
> DStream.scala:593, took 0.002217 s
> ---
> Time: 140145446 ms
> ---
> 99,true,-0.11342268416043325
> 17,false,1.6732879882133793
> ...
>  
>  
> Then keeps repeating the following with no more evidence that the
> print() action is being called:
>  
> ...
> 2014-05-30 13:54:20 INFO JobScheduler:58 - Finished job streaming job
> 140145446 ms.0 from job set of time 140145446 ms
> 2014-05-30 13:54:20 INFO MappedRDD:58 - Removing RDD 8 from persistence list
> 2014-05-30 13:54:20 INFO JobScheduler:58 - Total delay: 0.019 s for
> time 140145446 ms (execution: 0.015 s)
> 2014-05-30 13:54:20 INFO BlockManager:58 - Removing RDD 8
> 2014-05-30 13:54:20 INFO BlockRDD:58 - Removing RDD 7 from persistence list
&

Re: spark 1.0 standalone application

2014-05-19 Thread Nan Zhu

First time to know there is a temporary maven repository…….

--  
Nan Zhu


On Monday, May 19, 2014 at 10:10 PM, Patrick Wendell wrote:

> Whenever we publish a release candidate, we create a temporary maven
> repository that host the artifacts. We do this precisely for the case
> you are running into (where a user wants to build an application
> against it to test).
>  
> You can build against the release candidate by just adding that
> repository in your sbt build, then linking against "spark-core"
> version "1.0.0". For rc9 the repository is in the vote e-mail:
>  
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html
>  
> On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra  (mailto:m...@clearstorydata.com)> wrote:
> > That's the crude way to do it. If you run `sbt/sbt publishLocal`, then you
> > can resolve the artifact from your local cache in the same way that you
> > would resolve it if it were deployed to a remote cache. That's just the
> > build step. Actually running the application will require the necessary
> > jars to be accessible by the cluster nodes.
> >  
> >  
> > On Mon, May 19, 2014 at 7:04 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> >  
> > > en, you have to put spark-assembly-*.jar to the lib directory of your
> > > application
> > >  
> > > Best,
> > >  
> > > --
> > > Nan Zhu
> > >  
> > >  
> > > On Monday, May 19, 2014 at 9:48 PM, nit wrote:
> > >  
> > > > I am not much comfortable with sbt. I want to build a standalone
> > > application
> > > > using spark 1.0 RC9. I can build sbt assembly for my application with
> > >  
> > > Spark
> > > > 0.9.1, and I think in that case spark is pulled from Aka Repository?
> > > >  
> > > > Now if I want to use 1.0 RC9 for my application; what is the process ?
> > > > (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> > > > sbt-assembly jar; and I think I will have to copy my jar somewhere? and
> > > > update build.sbt?)
> > > >  
> > > > PS: I am not sure if this is the right place for this question; but 
> > > > since
> > > > 1.0 is still RC, I felt that this may be appropriate forum.
> > > >  
> > > > thank!
> > > >  
> > > >  
> > > >  
> > > > --
> > > > View this message in context:
> > > >  
> > >  
> > > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> > > > Sent from the Apache Spark Developers List mailing list archive at
> > >  
> > > Nabble.com (http://Nabble.com).
> > > >  
> > >  
> > >  
> >  
> >  
>  
>  
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Nan Zhu

just rerun my test on rc5 

everything works

build applications with sbt and the spark-*.jar which is compiled with Hadoop 
2.3

+1 

-- 
Nan Zhu


On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:

> How to reproduce this bug?
> 
> 
> -- Original --
> From: "Patrick Wendell";mailto:pwend...@gmail.com)>;
> Date: Mon, May 19, 2014 10:08 AM
> To: "dev@spark.apache.org (mailto:dev@spark.apache.org)" (mailto:dev@spark.apache.org)>; 
> Cc: "Tom Graves"mailto:tgraves...@yahoo.com)>; 
> Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
> 
> 
> 
> Hey Matei - the issue you found is not related to security. This patch
> a few days ago broke builds for Hadoop 1 with YARN support enabled.
> The patch directly altered the way we deal with commons-lang
> dependency, which is what is at the base of this stack trace.
> 
> https://github.com/apache/spark/pull/754
> 
> - Patrick
> 
> On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia  (mailto:matei.zaha...@gmail.com)> wrote:
> > Alright, I've opened https://github.com/apache/spark/pull/819 with the 
> > Windows fixes. I also found one other likely bug, 
> > https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
> > for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
> > code depending on a different version of org.apache.commons than Hadoop 2, 
> > but it needs investigation. Tom, any thoughts on this?
> > 
> > Matei
> > 
> > On May 18, 2014, at 12:33 PM, Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> wrote:
> > 
> > > I took the always fun task of testing it on Windows, and unfortunately, I 
> > > found some small problems with the prebuilt packages due to recent 
> > > changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
> > > instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
> > > quite match the master-setting behavior of the Unix based one. I'll send 
> > > a pull request to fix them soon.
> > > 
> > > Matei
> > > 
> > > 
> > > On May 17, 2014, at 11:32 AM, Sandy Ryza  > > (mailto:sandy.r...@cloudera.com)> wrote:
> > > 
> > > > +1
> > > > 
> > > > Reran my tests from rc5:
> > > > 
> > > > * Built the release from source.
> > > > * Compiled Java and Scala apps that interact with HDFS against it.
> > > > * Ran them in local mode.
> > > > * Ran them against a pseudo-distributed YARN cluster in both yarn-client
> > > > mode and yarn-cluster mode.
> > > > 
> > > > 
> > > > On Sat, May 17, 2014 at 10:08 AM, Andrew Or  > > > (mailto:and...@databricks.com)> wrote:
> > > > 
> > > > > +1
> > > > > 
> > > > > 
> > > > > 2014-05-17 8:53 GMT-07:00 Mark Hamstra  > > > > (mailto:m...@clearstorydata.com)>:
> > > > > 
> > > > > > +1
> > > > > > 
> > > > > > 
> > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
> > > > > > mailto:pwend...@gmail.com)
> > > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > > I'll start the voting with a +1.
> > > > > > > 
> > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
> > > > > > > mailto:pwend...@gmail.com)>
> > > > > > > wrote:
> > > > > > > > Please vote on releasing the following candidate as Apache Spark
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > version
> > > > > > > 1.0.0!
> > > > > > > > This has one bug fix and one minor feature on top of rc8:
> > > > > > > > SPARK-1864: https://github.com/apache/spark/pull/808
> > > > > > > > SPARK-1808: https://github.com/apache/spark/pull/799
> > > > > > > > 
> > > > > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947):
> > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
> > > > > > > > 
> > > > > > > > The release files, including signatures, digests, etc. can be 
> > > > > > > > found
> > > > > at:
> > > > &g

Re: spark 1.0 standalone application

2014-05-19 Thread Nan Zhu

en, you have to put spark-assembly-*.jar to the lib directory of your 
application 

Best, 

-- 
Nan Zhu


On Monday, May 19, 2014 at 9:48 PM, nit wrote:

> I am not much comfortable with sbt. I want to build a standalone application
> using spark 1.0 RC9. I can build sbt assembly for my application with Spark
> 0.9.1, and I think in that case spark is pulled from Aka Repository?
> 
> Now if I want to use 1.0 RC9 for my application; what is the process ?
> (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
> sbt-assembly jar; and I think I will have to copy my jar somewhere? and
> update build.sbt?)
> 
> PS: I am not sure if this is the right place for this question; but since
> 1.0 is still RC, I felt that this may be appropriate forum.
> 
> thank! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu

+1, replaced rc3 with rc5, all applications are working fine

Best, 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 8:03 PM, Madhu wrote:

> I built rc5 using sbt/sbt assembly on Linux without any problems.
> There used to be an sbt.cmd for Windows build, has that been deprecated?
> If so, I can document the Windows build steps that worked for me.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu

Ah, I see, thanks 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 12:59 PM, Mark Hamstra wrote:

> There were a few early/test RCs this cycle that were never put to a vote.
> 
> 
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> 
> > just curious, where is rc4 VOTE?
> > 
> > I searched my gmail but didn't find that?
> > 
> > 
> > 
> > 
> > On Tue, May 13, 2014 at 9:49 AM, Sean Owen  > (mailto:so...@cloudera.com)> wrote:
> > 
> > > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell  > > (mailto:pwend...@gmail.com)>
> > > wrote:
> > > > The release files, including signatures, digests, etc. can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> > > > 
> > > 
> > > 
> > > Good news is that the sigs, MD5 and SHA are all correct.
> > > 
> > > Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> > > use SHA512, which took me a bit of head-scratching to figure out.
> > > 
> > > If another RC comes out, I might suggest making it SHA1 everywhere?
> > > But there is nothing wrong with these signatures and checksums.
> > > 
> > > Now to look at the contents...

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu

just curious, where is rc4 VOTE?

I searched my gmail but didn't find that?




On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
> wrote:
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...
>

Re: Spark 1.0.0 rc3

2014-05-03 Thread Nan Zhu

SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly 

and copy the generated jar to lib/ directory of my application, 

it seems that sbt cannot find the dependencies in the jar?

but everything works with the pre-built jar files downloaded from the link 
provided by Patrick

Best, 

-- 
Nan Zhu


On Thursday, May 1, 2014 at 11:16 PM, Madhu wrote:

> I'm guessing EC2 support is not there yet?
> 
> I was able to build using the binary download on both Windows 7 and RHEL 6
> without issues.
> I tried to create an EC2 cluster, but saw this:
> 
> ~/spark-ec2
> Initializing spark
> ~ ~/spark-ec2
> ERROR: Unknown Spark version
> Initializing shark
> ~ ~/spark-ec2 ~/spark-ec2
> ERROR: Unknown Shark version
> 
> The spark dir on the EC2 master has only a conf dir, so it didn't deploy
> properly.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
>

Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-23 Thread Nan Zhu

I’m just asked by others for the same question  

I think Reynold gave a pretty helpful tip on this,  

Shall we put this on Contribute-to-Spark wiki?  

--  
Nan Zhu


Forwarded message:

> From: Reynold Xin 
> Reply To: d...@spark.incubator.apache.org
> To: d...@spark.incubator.apache.org 
> Date: Thursday, February 6, 2014 at 7:50:57 PM
> Subject: Re: Is there any way to make a quick test on some pre-commit code?
>  
> You can do
>  
> sbt/sbt assemble-deps
>  
>  
> and then just run
>  
> sbt/sbt package
>  
> each time.
>  
>  
> You can even do
>  
> sbt/sbt ~package
>  
> for automatic incremental compilation.
>  
>  
>  
> On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
>  
> > Hi, all
> >  
> > Is it always necessary to run sbt assembly when you want to test some code,
> >  
> > Sometimes you just repeatedly change one or two lines for some failed test
> > case, it is really time-consuming to sbt assembly every time
> >  
> > any faster way?
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
>  
>  
>  
>

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nan Zhu

I thought those are files of spark.apache.org? 

-- 
Nan Zhu


On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
> 
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza  (mailto:sandy.r...@cloudera.com)> wrote:
> > How do I get permissions to edit the wiki?
> > 
> > 
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng  > (mailto:men...@gmail.com)> wrote:
> > 
> > > Cannot agree more with your words. Could you add one section about
> > > "how and what to contribute" to MLlib's guide? -Xiangrui
> > > 
> > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> > > mailto:nick.pentre...@gmail.com)> wrote:
> > > > I'd say a section in the "how to contribute" page would be a good place
> > > 
> > > to put this.
> > > > 
> > > > In general I'd say that the criteria for inclusion of an algorithm is it
> > > should be high quality, widely known, used and accepted (citations and
> > > concrete use cases as examples of this), scalable and parallelizable, well
> > > documented and with reasonable expectation of dev support
> > > > 
> > > > Sent from my iPhone
> > > > 
> > > > > On 21 Apr 2014, at 19:59, Sandy Ryza  > > > > (mailto:sandy.r...@cloudera.com)> wrote:
> > > > > 
> > > > > If it's not done already, would it make sense to codify this 
> > > > > philosophy
> > > > > somewhere? I imagine this won't be the first time this discussion 
> > > > > comes
> > > > > up, and it would be nice to have a doc to point to. I'd be happy to
> > > > > 
> > > > 
> > > > 
> > > 
> > > take a
> > > > > stab at this.
> > > > > 
> > > > > 
> > > > > > On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng  > > > > > (mailto:men...@gmail.com)>
> > > wrote:
> > > > > > 
> > > > > > +1 on Sean's comment. MLlib covers the basic algorithms but we
> > > > > > definitely need to spend more time on how to make the design 
> > > > > > scalable.
> > > > > > For example, think about current "ProblemWithAlgorithm" naming 
> > > > > > scheme.
> > > > > > That being said, new algorithms are welcomed. I wish they are
> > > > > > well-established and well-understood by users. They shouldn't be
> > > > > > research algorithms tuned to work well with a particular dataset but
> > > > > > not tested widely. You see the change log from Mahout:
> > > > > > 
> > > > > > ===
> > > > > > The following algorithms that were marked deprecated in 0.8 have 
> > > > > > been
> > > > > > removed in 0.9:
> > > > > > 
> > > > > > From Clustering:
> > > > > > Switched LDA implementation from using Gibbs Sampling to Collapsed
> > > > > > Variational Bayes (CVB)
> > > > > > Meanshift
> > > > > > MinHash - removed due to poor performance, lack of support and lack 
> > > > > > of
> > > > > > usage
> > > > > > 
> > > > > > From Classification (both are sequential implementations)
> > > > > > Winnow - lack of actual usage and support
> > > > > > Perceptron - lack of actual usage and support
> > > > > > 
> > > > > > Collaborative Filtering
> > > > > > SlopeOne implementations in
> > > > > > org.apache.mahout.cf.taste.hadoop.slopeone and
> > > > > > org.apache.mahout.cf.taste.impl.recommender.slopeone
> > > > > > Distributed pseudo recommender in
> > > > > > org.apache.mahout.cf.taste.hadoop.pseudo
> > > > > > TreeClusteringRecommender in
> > > > > > org.apache.mahout.cf.taste.impl.recommender
> > > > > > 
> > > > > > Mahout Math
> > > > > > Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> > > > > > ===
> > > > > > 
> > > > > > In MLlib, we should include the algorithms users know how to use and
> > > > > > we can provide support rather than letting algorithms come and

Re: It seems that jenkins for PR is not working

2014-04-14 Thread Nan Zhu

+1….  

--  
Nan Zhu


On Friday, April 11, 2014 at 5:35 PM, DB Tsai wrote:

> I always got
> =
>  
> Could not find Apache license headers in the following files:
> !? /root/workspace/SparkPullRequestBuilder/python/metastore/db.lck
> !? 
> /root/workspace/SparkPullRequestBuilder/python/metastore/service.properties
>  
>  
> Sincerely,
>  
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>  
>

Re: Flaky streaming tests

2014-04-07 Thread Nan Zhu

I met this issue when Jenkins seems to be very busy

On Monday, April 7, 2014, Kay Ousterhout  wrote:

> Hi all,
>
> The InputStreamsSuite seems to have some serious flakiness issues -- I've
> seen the file input stream fail many times and now I'm seeing some actor
> input stream test failures (
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> )
> on what I think is an unrelated change.  Does anyone know anything about
> these?  Should we just remove some of these tests since they seem to be
> constantly failing?
>
> -Kay
>

Re: The difference between driver and master in Spark

2014-03-31 Thread Nan Zhu

master is managing the resources in the cluster, e.g. ensuring all components 
can work together, master/worker/driver 

e.g. you have to submit your application with the path: driver -> master -> 
worker

then 

the driver take most of the responsibility of running your application, e.g. 
scheduling jobs/tasks

the driver is more like a user-facing component, while master is more 
transparent to the user

Best, 

-- 
Nan Zhu

On Monday, March 31, 2014 at 10:48 AM, Dan wrote:

> Hi,
> 
> I've been recently reading spark code and confused about driver and
> master. What's the difference between them?
> 
> When I run spark in standalone cluster, from the log it seems that the
> driver has not been launched.
> 
> Thanks,
> Dan
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/The-difference-between-driver-and-master-in-Spark-tp6158.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
>

Re: Migration to the new Spark JIRA

2014-03-29 Thread Nan Zhu

That’s great!  

Andy, thank you for all your contributions to the community !

Best,  

--  
Nan Zhu


On Saturday, March 29, 2014 at 11:40 PM, Patrick Wendell wrote:

> Hey All,
>  
> We've successfully migrated the Spark JIRA to the Apache infrastructure.
> This turned out to be a huge effort, lead by Andy Konwinski, who deserves
> all of our deepest appreciation for managing this complex migration
>  
> Since Apache runs the same JIRA version as Spark's existing JIRA, there is
> no new software to learn. A few things to note though:
>  
> - The issue tracker for Spark is now at:
> https://issues.apache.org/jira/browse/SPARK
>  
> - You can sign up to receive an e-mail feed of JIRA updates by e-mailing:
> issues-subscr...@spark.apache.org (mailto:issues-subscr...@spark.apache.org)
>  
> - DO NOT create issues on the old JIRA. I'll try to disable this so that it
> is read-only.
>  
> - You'll need to create an account at the new site if you don't have one
> already.
>  
> - We've imported all the old JIRA's. In some cases the import tool can't
> correctly guess the assignee for the JIRA, so we may have to do some manual
> assignment.
>  
> - If you feel like you don't have sufficient permissions on the new JIRA,
> please send me an e-mail. I tried to add all of the committers as
> administrators but I may have missed some.
>  
> Thanks,
> Patrick
>  
>

Re: Travis CI

2014-03-29 Thread Nan Zhu

Hi, Michael  

Thank you so much for your reply

Here is an example on hive/test, 
https://travis-ci.org/apache/spark/builds/21834835

From your reply, the aborted hive/test is due to the failure in BagelSuite 
(though that’s also a weird one)?

Best,  

--  
Nan Zhu


On Saturday, March 29, 2014 at 8:21 PM, Michael Armbrust wrote:

> >  
> > Is the migration from Jenkins to Travis finished?
>  
> It is not finished and really at this point it is only something we are
> considering, not something that will happen for sure. We turned it on in
> addition to Jenkins so that we could start finding issues exactly like the
> ones you described below to determine if Travis is going to be a viable
> option.
>  
> Basically it seems to me that the Travis environment is a little less
> predictable (probably because of virtualization) and this is pointing out
> some existing flakey-ness in the tests
>  
> If there are tests that are regularly flakey we should probably file JIRAs
> so they can be fixed or switched off. If you have seen a test fail 2-3
> times and then pass with no changes, I'd say go ahead and file an issue for
> it (others should feel free to chime in if we want some other process here)
>  
> A few more specific comments inline below.
>  
>  
> > 2. hive/test usually aborted because it doesn't output anything within 10
> > minutes
> >  
>  
>  
> Hmm, this is a little confusing. Do you have a pointer to this one? Was
> there any other error?
>  
>  
> > 4. hive/test didn't finish in 50 minutes, and was aborted
>  
> Here I think the right thing to do is probably break the hive tests in two
> and run them in parallel. There is already machinery for doing this, we
> just need to flip the options on in the travis.yml to make it happen. This
> is only going to get more critical as we whitelist more hive tests. We
> also talked about checking the PR and skipping the hive tests when there
> have been no changes in catalyst/sql/hive. I'm okay with this plan, just
> need to find someone with time to implement it
>  
>

Re: Travis CI

2014-03-29 Thread Nan Zhu

Hi,   

Is the migration from Jenkins to Travis finished?

I think Travis is actually not stable based on the observations in these days 
(and Jenkins becomes unstable too……  :-(  ), I’m actively working on two PRs 
related to DAGScheduler, I saw

Problem on Travis:  

1. test “large number of iterations”  in BagelSuite sometimes failed, because 
it doesn’t output anything within 10 seconds

2. hive/test usually aborted because it doesn’t output anything within 10 
minutes

3. a test case in Streaming.CheckpointSuite failed  

4. hive/test didn’t finish in 50 minutes, and was aborted

Problem on Jenkins:

1. didn’t finish in 90mins, and the process is aborted

2. the same as 3 in Travis problem

Some of these problems once appeared in Jenkins months, but not so often

I’m not complaining, I know that the admins are working hard to make the 
community run in a good condition on every aspect,  

I’m just reporting what I saw and hope that can help you to identify the problem

Thank you  

--  
Nan Zhu


On Tuesday, March 25, 2014 at 10:11 PM, Patrick Wendell wrote:

> Ya It's been a little bit slow lately because of a high error rate in
> interactions with the git-hub API. Unfortunately we are pretty slammed
> for the release and haven't had a ton of time to do further debugging.
>  
> - Patrick
>  
> On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > I just found that the Jenkins is not working from this afternoon
> >  
> > for one PR, the first time build failed after 90 minutes, the second time it
> > has run for more than 2 hours, no result is returned
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
> >  
> > On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:
> >  
> > That's not correct - like Michael said the Jenkins build remains the
> > reference build for now.
> >  
> > On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> >  
> > I assume the Jenkins is not working now?
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
> >  
> > On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
> >  
> > Just a quick note to everyone that Patrick and I are playing around with
> > Travis CI on the Spark github repository. For now, travis does not run all
> > of the test cases, so will only be turned on experimentally. Long term it
> > looks like Travis might give better integration with github, so we are
> > going to see if it is feasible to get all of our tests running on it.
> >  
> > *Jenkins remains the reference CI and should be consulted before merging
> > pull requests, independent of what Travis says.*
> >  
> > If you have any questions or want to help out with the investigation, let
> > me know!
> >  
> > Michael

a weird test case in Streaming

2014-03-29 Thread Nan Zhu

Hi, all  

The "recovery with file input stream” in the Streaming.CheckpointSuite 
sometimes failed even you are working on a totally irrelevant part, I met this 
problem for 3+ times.

I assume this test case is likely to fail when the testing servers are very 
busy?

Two cases from others:

Sean: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/

Mark: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13531/


Best,  

--  
Nan Zhu

Re: Mailbomb from amplabs jenkins ?

2014-03-27 Thread Nan Zhu

yes, it sends for every PR you were involved 

I think Patrick is doing something on Jenkins, he just stopped some testing 
jobs manually 

Best, 

-- 
Nan Zhu


On Thursday, March 27, 2014 at 11:07 PM, Mridul Muralidharan wrote:

> Got some 100 odd mails from jenkins (?) with "Can one of the admins
> verify this patch?"
> Part of upgrade or some other issue ?
> Significantly reduced the snr of my inbox !
> 
> Regards,
> Mridul
> 
>

Re: Travis CI

2014-03-25 Thread Nan Zhu

I just found that the Jenkins is not working from this afternoon

for one PR, the first time build failed after 90 minutes, the second time it 
has run for more than 2 hours, no result is returned

Best, 

-- 
Nan Zhu



On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:

> That's not correct - like Michael said the Jenkins build remains the
> reference build for now.
> 
> On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > I assume the Jenkins is not working now?
> > 
> > Best,
> > 
> > --
> > Nan Zhu
> > 
> > 
> > On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
> > 
> > Just a quick note to everyone that Patrick and I are playing around with
> > Travis CI on the Spark github repository. For now, travis does not run all
> > of the test cases, so will only be turned on experimentally. Long term it
> > looks like Travis might give better integration with github, so we are
> > going to see if it is feasible to get all of our tests running on it.
> > 
> > *Jenkins remains the reference CI and should be consulted before merging
> > pull requests, independent of what Travis says.*
> > 
> > If you have any questions or want to help out with the investigation, let
> > me know!
> > 
> > Michael

Re: Travis CI

2014-03-25 Thread Nan Zhu

I assume the Jenkins is not working now? 

Best, 

-- 
Nan Zhu



On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:

> Just a quick note to everyone that Patrick and I are playing around with
> Travis CI on the Spark github repository. For now, travis does not run all
> of the test cases, so will only be turned on experimentally. Long term it
> looks like Travis might give better integration with github, so we are
> going to see if it is feasible to get all of our tests running on it.
> 
> *Jenkins remains the reference CI and should be consulted before merging
> pull requests, independent of what Travis says.*
> 
> If you have any questions or want to help out with the investigation, let
> me know!
> 
> Michael

How the scala style checker works?

2014-03-19 Thread Nan Zhu

Hi, all  

I’m just curious about the working mechanism of scala style checker

When I work on a PR, I found that the following line contains 101 chars, 
violating the 100 limitation  

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515

but the current scala style checker passes this line?

Best,  

--  
Nan Zhu

Re: test cases stuck on "local-cluster mode" of ReplSuite?

2014-03-14 Thread Nan Zhu

Yeah,  

I tested that, I had my SPARK_HOME point to a very old location, after I fixed 
that, everything goes well

Thank you so much for pointing this out

Best, 

-- 
Nan Zhu


On Friday, March 14, 2014 at 6:41 PM, Michael Armbrust wrote:

> Sorry to revive an old thread, but I just ran into this issue myself. It
> is likely that you do not have the assembly jar built, or that you have
> SPARK_HOME set incorrectly (it does not need to be set).
> 
> Michael
> 
> 
> On Thu, Feb 27, 2014 at 8:13 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> 
> > Hi, all
> > 
> > Actually this problem exists for months in my side, when I run the test
> > cases, it will stop (actually pause?) at the ReplSuite
> > 
> > [info] ReplSuite:
> > 2014-02-27 10:57:37.220 java[3911:1303] Unable to load realm info from
> > SCDynamicStore
> > [info] - propagation of local properties (7 seconds, 646 milliseconds)
> > [info] - simple foreach with accumulator (6 seconds, 204 milliseconds)
> > [info] - external vars (4 seconds, 271 milliseconds)
> > [info] - external classes (3 seconds, 186 milliseconds)
> > [info] - external functions (4 seconds, 843 milliseconds)
> > [info] - external functions that access vars (3 seconds, 503 milliseconds)
> > [info] - broadcast vars (4 seconds, 313 milliseconds)
> > [info] - interacting with files (2 seconds, 492 milliseconds)
> > 
> > 
> > 
> > The next test case should be
> > 
> > test("local-cluster mode") {
> > val output = runInterpreter("local-cluster[1,1,512]",
> > """
> > |var v = 7
> > |def getV() = v
> > |sc.parallelize(1 to 10).map(x => getV()).collect.reduceLeft(_+_)
> > |v = 10
> > |sc.parallelize(1 to 10).map(x => getV()).collect.reduceLeft(_+_)
> > |var array = new Array[Int](5)
> > |val broadcastArray = sc.broadcast(array)
> > |sc.parallelize(0 to 4).map(x => broadcastArray.value(x)).collect
> > |array(0) = 5
> > |sc.parallelize(0 to 4).map(x => broadcastArray.value(x)).collect
> > """.stripMargin)
> > assertDoesNotContain("error:", output)
> > assertDoesNotContain("Exception", output)
> > assertContains("res0: Int = 70", output)
> > assertContains("res1: Int = 100", output)
> > assertContains("res2: Array[Int] = Array(0, 0, 0, 0, 0)", output)
> > assertContains("res4: Array[Int] = Array(0, 0, 0, 0, 0)", output)
> > }
> > 
> > 
> > 
> > I didn't see any reason for it spending so much time on it
> > 
> > Any idea? I'm using mbp, OS X 10.9.1, Intel Core i7 2.9 GHz, Memory 8GB
> > 1600 MHz DDR3
> > 
> > Best,
> > 
> > --
> > Nan Zhu
> > 
> 
> 
>

ping of PR #12

2014-03-10 Thread Nan Zhu

Hi, all

I understand that you are very busy,

But it seems that this PR has been there for a long while, and there have been 
some discussions in its incubator-spark version: 
https://github.com/apache/incubator-spark/pull/636

The current URL:

https://github.com/apache/spark/pull/12

Thank you very much! 

-- 
Nan Zhu

Undocumented configuration parameters

2014-03-05 Thread Nan Zhu

Hi, all  

Just for curiosity, I grep the  source code in core component yesterday, and 
found that there are about 30 configuration parameters being used but 
undocumented

I have been reading the source code and writing documents for them, nearly 
finished…

But I would like to ask before I made the PR, what’s the reason of the missing 
documentations, the contributor forgot to update the docs, or they are intended 
to be hidden maybe some parameters are not expected to be changed by the user?

Best,  

--  
Nan Zhu

[SUGGESTION] suggest contributors to run sbt scalastyle before run sbt test

2014-03-03 Thread Nan Zhu

Hi, all

I noticed this because…my two PRs failed for the style error (exceeding for 3 - 
5 chars) yesterday

Maybe we can explicitly suggest contributors to run sbt scalastyle before they 
run test cases  

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Just add one sentence after 4 is good enough, I think  

Best,  

--  
Nan Zhu

Re: Spark JIRA

2014-02-28 Thread Nan Zhu

I think they are working on it? https://issues.apache.org/jira/browse/SPARK 

Best, 

-- 
Nan Zhu

On Friday, February 28, 2014 at 2:29 PM, Evan Chan wrote:

> Hey guys,
> 
> There is no plan to move the Spark JIRA from the current
> https://spark-project.atlassian.net/
> 
> right?
> 
> -- 
> --
> Evan Chan
> Staff Engineer
> e...@ooyala.com (mailto:e...@ooyala.com) |
> 
>

Re: Discussion on SPARK-1139

2014-02-27 Thread Nan Zhu

any discussion on this?  

I would like to hear more advices from the community before I create the PR,

an example is how to create a NewHadoopRDD


we get a configuration from JobContext

val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf)


then we create a jobContext based on this configuration object

NewHadoopRDD.scala (L74)
val jobContext = newJobContext(conf, jobId)
val rawSplits = inputFormat.getSplits(jobContext).toArray


because inputFormat is from mapreduce package, it only accept a JobContext as 
the parameter in its methods


I think we should avoid introduce Configuration as the parameter, but same 
thing as before, it will change the APIs


Best,  

--  
Nan Zhu


On Wednesday, February 26, 2014 at 8:23 AM, Nan Zhu wrote:

> Hi, all  
>  
> I just created a JIRA https://spark-project.atlassian.net/browse/SPARK-1139 . 
> The issue discusses that:
>  
> the new Hadoop API based Spark APIs are actually a mixture of old and new 
> Hadoop API.
>  
> Spark APIs are still using JobConf (or Configuration) as one of the 
> parameters, but actually Configuration has been replaced by mapreduce.Job in 
> the new Hadoop API
>  
> for example : 
> http://codesfusion.blogspot.ca/2013/10/hadoop-wordcount-with-new-map-reduce-api.html
>   
>  
> &  
>  
> http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api (p10)
>  
> Personally I think it’s better to fix this design, but it will introduce some 
> compatibility issue  
>  
> Just bring it here for your advices
>  
> Best,  
>  
> --  
> Nan Zhu
>

test cases stuck on "local-cluster mode" of ReplSuite?

2014-02-27 Thread Nan Zhu

Hi, all  

Actually this problem exists for months in my side, when I run the test cases, 
it will stop (actually pause?) at the ReplSuite

[info] ReplSuite:  
2014-02-27 10:57:37.220 java[3911:1303] Unable to load realm info from 
SCDynamicStore
[info] - propagation of local properties (7 seconds, 646 milliseconds)
[info] - simple foreach with accumulator (6 seconds, 204 milliseconds)
[info] - external vars (4 seconds, 271 milliseconds)
[info] - external classes (3 seconds, 186 milliseconds)
[info] - external functions (4 seconds, 843 milliseconds)
[info] - external functions that access vars (3 seconds, 503 milliseconds)
[info] - broadcast vars (4 seconds, 313 milliseconds)
[info] - interacting with files (2 seconds, 492 milliseconds)



The next test case should be  

test("local-cluster mode") {
val output = runInterpreter("local-cluster[1,1,512]",
  """
|var v = 7
|def getV() = v
|sc.parallelize(1 to 10).map(x => getV()).collect.reduceLeft(_+_)
|v = 10
|sc.parallelize(1 to 10).map(x => getV()).collect.reduceLeft(_+_)
|var array = new Array[Int](5)
|val broadcastArray = sc.broadcast(array)
|sc.parallelize(0 to 4).map(x => broadcastArray.value(x)).collect
|array(0) = 5
|sc.parallelize(0 to 4).map(x => broadcastArray.value(x)).collect
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains("res0: Int = 70", output)
assertContains("res1: Int = 100", output)
assertContains("res2: Array[Int] = Array(0, 0, 0, 0, 0)", output)
assertContains("res4: Array[Int] = Array(0, 0, 0, 0, 0)", output)
  }



I didn’t see any reason for it spending so much time on it….

Any idea? I’m using mbp, OS X 10.9.1, Intel Core i7 2.9 GHz, Memory 8GB 1600 
MHz DDR3

Best,

--  
Nan Zhu

Re: [IMPORTANT] Github/jenkins migration

2014-02-26 Thread Nan Zhu

 Hi, Patrick,

How to deal with the active pull requests in the old repository?

The contributors have to do something?

Best,

-- 
Nan Zhu

On Wednesday, February 26, 2014 at 5:37 PM, Patrick Wendell wrote:

Hey All,

The github incubator-spark mirror has been migrated to [1] by Apache
infra and we've migrated Jenkins to reflect the new changes. This
means the existing "incubator-spark" mirror is becoming outdated and
no longer correctly displays pull request diff's.

We've asked apache infra to see if they can migrate existing pull
requests to incubator-spark. However since this relies on coordinating
with github, I'm not entirely sure whether they can do this or what
the timeline would be.

In the mean time it would be good for people to open new pull requests
against [1]. For pull requests that were *just* about to be merged, we
can go manually merge them, but ones that require feedback and more
rounds of testing will need to be done on the new one since
incubator-spark is now out of date.

Sorry about this inconvenience, it is a one-time transition and we
won't ever have to do it again.

[1] https://github.com/apache/spark

- Patrick

Discussion on SPARK-1139

2014-02-26 Thread Nan Zhu

Hi, all  

I just created a JIRA https://spark-project.atlassian.net/browse/SPARK-1139 . 
The issue discusses that:

the new Hadoop API based Spark APIs are actually a mixture of old and new 
Hadoop API.

Spark APIs are still using JobConf (or Configuration) as one of the parameters, 
but actually Configuration has been replace by mapreduce.Job in the new Hadoop 
API

for example : 
http://codesfusion.blogspot.ca/2013/10/hadoop-wordcount-with-new-map-reduce-api.html
  

&  

http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api (p10)

Personally I think it’s better to fix this design, but it will introduce some 
compatibility issue  

Just bring it here for your advices

Best,  

--  
Nan Zhu

91 matches

Mail list logo