Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-24 Thread Steve Loughran

On 24 Dec 2015, at 05:59, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

FYI: I opened an INFRA ticket with questions about how best to use the Apache 
mirror network.

https://issues.apache.org/jira/browse/INFRA-10999

Nick


not that likely to get an answer as it's really a support call, not a bug/task. 
You never know though.

There's another way to get at binaries, which is check them out direct from SVN

https://dist.apache.org/repos/dist/release/

This is a direct view into how you release things in the ASF (you just create a 
new dir under your project, copy the files and then do an svn commit; I believe 
the replicated servers may just do svn update on their local cache.

there's no mirroring, if you install to lots of machines your download time 
will be slow. You could automate it though, do something like D/L, upload to 
your own bucket, do an s3 GET.


[DAGScheduler] resubmitFailedStages, failedStages.clear() and submitStage

2015-12-24 Thread Jacek Laskowski
Hi,

While reviewing DAGScheduler, and where failedStages internal
collection of failed staged ready for resubmission is used, I came
across a question for which I'm looking an answer to. Any hints would
be greatly appreciated.

When resubmitFailedStages [1] is executed, and there are any failed
stages, they are resubmitted using submitStage [2], but before it
happens, failedStages is cleared [3] so when submitStage is called
that will ultimately call submitMissingTasks for the stage, it checks
whether the stage is in failedStages (among the other sets for waiting
and running stages) [4].

My naive understanding is that the call to submitStage is a no-op in
this case, i.e. nothing really happens and the if expression will
silently finish without doing anything useful until some other event
happens that changes the status of the failed stages into waiting
ones.

Is my understanding incorrect? Where? Could the call to submitStage be
superfluous? Please guide in the right direction. Thanks.

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L734
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L743
[3] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L741
[4] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L919

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [DAGScheduler] resubmitFailedStages, failedStages.clear() and submitStage

2015-12-24 Thread Ted Yu
getMissingParentStages(stage) would be called for the stage (being
re-submitted)

If there is no missing parents, submitMissingTasks() would be called.
If there is missing parent(s), the parent would go through the same flow.

I don't see issue in this part of the code.

Cheers

On Thu, Dec 24, 2015 at 5:19 AM, Jacek Laskowski  wrote:

> Hi,
>
> While reviewing DAGScheduler, and where failedStages internal
> collection of failed staged ready for resubmission is used, I came
> across a question for which I'm looking an answer to. Any hints would
> be greatly appreciated.
>
> When resubmitFailedStages [1] is executed, and there are any failed
> stages, they are resubmitted using submitStage [2], but before it
> happens, failedStages is cleared [3] so when submitStage is called
> that will ultimately call submitMissingTasks for the stage, it checks
> whether the stage is in failedStages (among the other sets for waiting
> and running stages) [4].
>
> My naive understanding is that the call to submitStage is a no-op in
> this case, i.e. nothing really happens and the if expression will
> silently finish without doing anything useful until some other event
> happens that changes the status of the failed stages into waiting
> ones.
>
> Is my understanding incorrect? Where? Could the call to submitStage be
> superfluous? Please guide in the right direction. Thanks.
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L734
> [2]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L743
> [3]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L741
> [4]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L919
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-24 Thread Vinay Shukla
+1
Tested on HDP 2.3, YARN cluster mode, spark-shell

On Wed, Dec 23, 2015 at 6:14 AM, Allen Zhang  wrote:

>
> +1 (non-binding)
>
> I have just tarball a new binary and tested am.nodelabelexpression and
> executor.nodelabelexpression manully, result is expected.
>
>
>
>
> At 2015-12-23 21:44:08, "Iulian Dragoș" 
> wrote:
>
> +1 (non-binding)
>
> Tested Mesos deployments (client and cluster-mode, fine-grained and
> coarse-grained). Things look good
> .
>
> iulian
>
> On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen  wrote:
>
>> Docker integration tests still fail for Mark and I, and should
>> probably be disabled:
>> https://issues.apache.org/jira/browse/SPARK-12426
>>
>> ... but if anyone else successfully runs these (and I assume Jenkins
>> does) then not a blocker.
>>
>> I'm having intermittent trouble with other tests passing, but nothing
>> unusual.
>> Sigs and hashes are OK.
>>
>> We have 30 issues fixed for 1.6.1. All but those resolved in the last
>> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>>
>>
>>
>>
>>
>> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
>>  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>> passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc4
>> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1176/
>> >
>> > The test repository (versioned as v1.6.0-rc4) for this release can be
>> found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1175/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>> >
>> > ===
>> > == How can I help test this release? ==
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > 
>> > == What justifies a -1 vote for this release? ==
>> > 
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> block
>> > this release.
>> >
>> > ===
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> target
>> > version.
>> >
>> >
>> > ==
>> > == Major changes to help you focus your testing ==
>> > ==
>> >
>> > Notable changes since 1.6 RC3
>> >
>> >
>> >   - SPARK-12404 - Fix serialization error for Datasets with
>> > Timestamps/Arrays/Decimal
>> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>> >   - SPARK-12413 - Fix mesos HA
>> >
>> >
>> > Notable changes since 1.6 RC2
>> >
>> >
>> > - SPARK_VERSION has been set correctly
>> > - SPARK-12199 ML Docs are publishing correctly
>> > - SPARK-12345 Mesos cluster mode has been fixed
>> >
>> > Notable changes since 1.6 RC1
>> >
>> > Spark Streaming
>> >
>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>> >
>> > Spark SQL
>> >
>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>> execution.
>> > SPARK-12258 correct passing null into ScalaUDF
>> >
>> > Notable Features Since 1.5
>> >
>> > Spark SQL
>> >
>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
>> > using flat schemas.
>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>> mydb)
>>

Shuffle Write Size

2015-12-24 Thread gsvic
Is there any formula with which I could determine Shuffle Write before
execution?

For example, in Sort Merge join in the stage in which the first table is
being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS
with block size 128 MB. Consequently is being loaded in 45 tasks/partitions.
How this 5.5 GB results in 429 MB? Could I determine it before execution? 

Environment:
#Workers = 2
#Cores/Worker = 4
#Assigned Memory / Worker = 512M

spark.shuffle.partitions=200
spark.shuffle.compress=false
spark.shuffle.memoryFraction=0.1
spark.shuffle.spill=true



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-Write-Size-tp15779.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-24 Thread Nicholas Chammas
not that likely to get an answer as it’s really a support call, not a
bug/task.

The first question is about proper documentation of all the stuff we’ve
been discussing in this thread, so one would think that’s a valid task. It
doesn’t seem right that closer.lua, for example, is undocumented. Either
it’s not meant for public use (and I am not an intended user), or there
should be something out there that explains how to use it.

I’m not looking for much; just some basic info that covers the various
things I’ve had to piece together from mailing lists and Google.

there’s no mirroring, if you install to lots of machines your download time
will be slow. You could automate it though, do something like D/L, upload
to your own bucket, do an s3 GET.

Yeah, this is what I’m probably going to do eventually—just use my own S3
bucket.

It’s disappointing that, at least as far as I can tell, the Apache
foundation doesn’t have a fast CDN or something like that to serve its
files. So users like me are left needing to come up with their own solution
if they regularly download Apache software to many machines in an automated
fashion.

Now, perhaps Apache mirrors are not meant to be used in this way. Perhaps
they’re just meant for people to do the one-off download to their personal
machines and that’s it. That’s totally fine! But that goes back to my first
question from the ticket—there should be a simple doc that spells this out
for us if that’s the case: “Don’t use the mirror network for automated
provisioning/deployments.” That would suffice. But as things stand now, I
have to guess and wonder at this stuff.

Nick
​

On Thu, Dec 24, 2015 at 5:43 AM Steve Loughran 
wrote:

>
> On 24 Dec 2015, at 05:59, Nicholas Chammas 
> wrote:
>
> FYI: I opened an INFRA ticket with questions about how best to use the
> Apache mirror network.
>
> https://issues.apache.org/jira/browse/INFRA-10999
>
> Nick
>
>
>
> not that likely to get an answer as it's really a support call, not a
> bug/task. You never know though.
>
> There's another way to get at binaries, which is check them out direct
> from SVN
>
> https://dist.apache.org/repos/dist/release/
>
> This is a direct view into how you release things in the ASF (you just
> create a new dir under your project, copy the files and then do an svn
> commit; I believe the replicated servers may just do svn update on their
> local cache.
>
> there's no mirroring, if you install to lots of machines your download
> time will be slow. You could automate it though, do something like D/L,
> upload to your own bucket, do an s3 GET.
>


Re: Shuffle Write Size

2015-12-24 Thread Xingchi Wang
I think shuffle write size not dependency on the your data, but on the join
operation, maybe your join action, don't need to shuffle more data, because
the table data has already on its partition, so it not need shuffle write,
is it possible?

2015-12-25 0:53 GMT+08:00 gsvic :

> Is there any formula with which I could determine Shuffle Write before
> execution?
>
> For example, in Sort Merge join in the stage in which the first table is
> being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS
> with block size 128 MB. Consequently is being loaded in 45
> tasks/partitions.
> How this 5.5 GB results in 429 MB? Could I determine it before execution?
>
> Environment:
> #Workers = 2
> #Cores/Worker = 4
> #Assigned Memory / Worker = 512M
>
> spark.shuffle.partitions=200
> spark.shuffle.compress=false
> spark.shuffle.memoryFraction=0.1
> spark.shuffle.spill=true
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-Write-Size-tp15779.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


latest Spark build error

2015-12-24 Thread salexln
 Hi all,

I'm getting build error when trying to build a clean version of latest
Spark. I did the following

1) git clone https://github.com/apache/spark.git
2) build/mvn -DskipTests clean package

But I get the following error:

Spark Project Parent POM .. FAILURE [2.338s]
...
BUILD FAILURE
...
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
(enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
failed. Look above for specific messages explaining why the rule failed. ->
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException



I'm running Lubuntu 14.04 with the following:

java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
Apache Maven 3.0.5 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/latest-Spark-build-error-tp15782.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: latest Spark build error

2015-12-24 Thread Kazuaki Ishizaki
This is because to build Spark requires maven 3.3.3 or later.
http://spark.apache.org/docs/latest/building-spark.html

Regards,
Kazuaki Ishizaki



From:   salexln 
To: dev@spark.apache.org
Date:   2015/12/25 15:52
Subject:latest Spark build error



 Hi all,

I'm getting build error when trying to build a clean version of latest
Spark. I did the following

1) git clone https://github.com/apache/spark.git
2) build/mvn -DskipTests clean package

But I get the following error:

Spark Project Parent POM .. FAILURE [2.338s]
...
BUILD FAILURE
...
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
(enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
failed. Look above for specific messages explaining why the rule failed. 
->
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the 
-e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, 
please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException



I'm running Lubuntu 14.04 with the following:

java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
Apache Maven 3.0.5 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/latest-Spark-build-error-tp15782.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org






How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-24 Thread zml张明磊

Hi,

   I am a new to Scala and Spark and trying to find relative API in 
DataFrame to solve my problem as title described. However, I just only find 
this API DataFrame.col(colName : String) : Column which returns an object of 
Column. Not the content. If only DataFrame support such API which like 
Column.toArray : Type is enough for me. But now, it doesn’t. How can I do can 
achieve this function ?

Thanks,
Minglei.


Re: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-24 Thread Jeff Zhang
Not sure what you mean. Do you want to choose some columns of a Row and
convert it to an Arrray ?

On Fri, Dec 25, 2015 at 3:35 PM, zml张明磊  wrote:

>
>
> Hi,
>
>
>
>I am a new to Scala and Spark and trying to find relative API in 
> DataFrame
> to solve my problem as title described. However, I just only find this API 
> *DataFrame.col(colName
> : String) : Column * which returns an object of Column. Not the content.
> If only DataFrame support such API which like *Column.toArray : Type* is
> enough for me. But now, it doesn’t. How can I do can achieve this function
> ?
>
>
>
> Thanks,
>
> Minglei.
>



-- 
Best Regards

Jeff Zhang


Re: latest Spark build error

2015-12-24 Thread salexln
Updating Maven version to 3.3.9 solved the issue

Thanks everyone!




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/latest-Spark-build-error-tp15782p15787.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



答复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-24 Thread zml张明磊
Thanks, Jeff. It’s not choose some columns of a Row. It’s just choose all data 
in a column and convert it to an Array. Do you understand my mean ?

In Chinese
我是想基于这个列名把这一列中的所有数据都选出来,然后放到数组里面去。


发件人: Jeff Zhang [mailto:zjf...@gmail.com]
发送时间: 2015年12月25日 15:39
收件人: zml张明磊
抄送: dev@spark.apache.org
主题: Re: How can I get the column data based on specific column name and then 
stored these data in array or list ?

Not sure what you mean. Do you want to choose some columns of a Row and convert 
it to an Arrray ?

On Fri, Dec 25, 2015 at 3:35 PM, zml张明磊 
mailto:mingleizh...@ctrip.com>> wrote:

Hi,

   I am a new to Scala and Spark and trying to find relative API in 
DataFrame to solve my problem as title described. However, I just only find 
this API DataFrame.col(colName : String) : Column which returns an object of 
Column. Not the content. If only DataFrame support such API which like 
Column.toArray : Type is enough for me. But now, it doesn’t. How can I do can 
achieve this function ?

Thanks,
Minglei.



--
Best Regards

Jeff Zhang