Re: Connect to remote YARN cluster

2015-04-09 Thread Steve Loughran

> On 9 Apr 2015, at 17:42, Marcelo Vanzin  wrote:
> 
> If YARN is authenticating users it's probably running on kerberos, so
> you need to log in with your kerberos credentials (kinit) before
> submitting an application.

also: make sure that you have the full JCE and not the crippled crypto; every 
time you upgrade the JDK you are likely to have to re-install it. Java gives no 
useful error messages on this or any other Kerberos problem

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: finding free ports for tests

2015-04-09 Thread Steve Loughran

On 8 Apr 2015, at 20:19, Hari Shreedharan 
mailto:hshreedha...@cloudera.com>> wrote:

One good way to guarantee your tests will work is to have your server bind to 
an ephemeral port and then query it to find the port it is running on. This 
ensures that race conditions don’t cause test failures.


yes, that's what I'm doing; the classic tactic. Find the tests fail if the 
laptop doesn't know its own name, but so do others


Thanks,
Hari



On Wed, Apr 8, 2015 at 3:24 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:

Utils.startServiceOnPort?

On Wed, Apr 8, 2015 at 6:16 AM, Steve Loughran 
mailto:ste...@hortonworks.com>> wrote:
>
> I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline 
> service integration, for which I need to allocate some free ports.
>
> I don't want to hard code them in as that can lead to unreliable tests, 
> especially on Jenkins.
>
> Before I implement the logic myself -Is there a utility class/trait for 
> finding ports for tests?
>
> -
> To unsubscribe, e-mail: 
> dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
> dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org





Re: enum-like types in Spark

2015-04-09 Thread Imran Rashid
any update here?  This is relevant for a currently open PR of mine -- I've
got a bunch of new public constants defined w/ format #4, but I'd gladly
switch to java enums.  (Even if we are just going to postpone this
decision, I'm still inclined to switch to java enums ...)

just to be clear about the existing problem with enums & scaladoc: right
now, the scaladoc knows about the enum class, and generates a page for it,
but it does not display the enum constants.  It is at least labeled as a
java enum, though, so a savvy user could switch to the javadocs to see the
constants.



On Mon, Mar 23, 2015 at 4:50 PM, Imran Rashid  wrote:

> well, perhaps I overstated things a little, I wouldn't call it the
> "official" solution, just a recommendation in the never-ending debate (and
> the recommendation from folks with their hands on scala itself).
>
> Even if we do get this fixed in scaladoc eventually -- as its not in the
> current versions, where does that leave this proposal?  personally I'd
> *still* prefer java enums, even if it doesn't get into scaladoc.  btw, even
> with sealed traits, the scaladoc still isn't great -- you don't see the
> values from the class, you only see them listed from the companion object.
>  (though, that is somewhat standard for scaladoc, so maybe I'm reaching a
> little)
>
>
>
> On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell 
> wrote:
>
>> If the official solution from the Scala community is to use Java
>> enums, then it seems strange they aren't generated in scaldoc? Maybe
>> we can just fix that w/ Typesafe's help and then we can use them.
>>
>> On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen  wrote:
>> > Yeah the fully realized #4, which gets back the ability to use it in
>> > switch statements (? in Scala but not Java?) does end up being kind of
>> > huge.
>> >
>> > I confess I'm swayed a bit back to Java enums, seeing what it
>> > involves. The hashCode() issue can be 'solved' with the hash of the
>> > String representation.
>> >
>> > On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid 
>> wrote:
>> >> I've just switched some of my code over to the new format, and I just
>> want
>> >> to make sure everyone realizes what we are getting into.  I went from
>> 10
>> >> lines as java enums
>> >>
>> >>
>> https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
>> >>
>> >> to 30 lines with the new format:
>> >>
>> >>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
>> >>
>> >> its not just that its verbose.  each name has to be repeated 4 times,
>> with
>> >> potential typos in some locations that won't be caught by the compiler.
>> >> Also, you have to manually maintain the "values" as you update the set
>> of
>> >> enums, the compiler won't do it for you.
>> >>
>> >> The only downside I've heard for java enums is enum.hashcode().  OTOH,
>> the
>> >> downsides for this version are: maintainability / verbosity, no
>> values(),
>> >> more cumbersome to use from java, no enum map / enumset.
>> >>
>> >> I did put together a little util to at least get back the equivalent of
>> >> enum.valueOf() with this format
>> >>
>> >>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
>> >>
>> >> I'm not trying to prevent us from moving forward on this, its fine if
>> this
>> >> is still what everyone wants, but I feel pretty strongly java enums
>> make
>> >> more sense.
>> >>
>> >> thanks,
>> >> Imran
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>
>


Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-09 Thread shane knapp
ok, we're looking good.  i'll keep an eye on this for the rest of the day,
and if you happen to notice any infrastructure failures before i do (i
updated a LOT), please let me know immediately!  :)

On Thu, Apr 9, 2015 at 8:38 AM, shane knapp  wrote:

> things are looking pretty good and i expect to be done within an hour.
>  i've got some test builds running right now, and will give the green light
> when they successfully complete.
>
> On Thu, Apr 9, 2015 at 7:29 AM, shane knapp  wrote:
>
>> and this is now happening.
>>
>> On Tue, Apr 7, 2015 at 4:38 PM, shane knapp  wrote:
>>
>>> reminder!  this is happening thurday morning.
>>>
>>> On Fri, Apr 3, 2015 at 9:59 AM, shane knapp  wrote:
>>>
 welcome to python2.7+, java 8 and more!  :)

 i'll be doing a major upgrade to our build system next thursday
 morning.  here's a quick list of what's going on:

 * installation of anaconda python on all worker nodes

 * installation of pypy 2.5.1 (python 2.7) on all nodes

 * matching installation of python modules for the current system python
 (2.6), and anaconda python (2.6, 2.7 and 3.4)
   - anaconda python 2.7 will be the default for all workers (this has
 stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
 and i've noticed no test failures)
   - you can now use anaconda environments to specify which version of
 python to use in your tests:  http://www.continuum.io/blog/conda

 * installation of new python 2.7 modules:  pymongo requests six pymongo
 requests six python-crontab

 * bare-bones mongodb installation on all workers

 * installation of java 1.6 and 1.8 internal to jenkins
   - jobs will default to the system java, which is 1.7.0_75
   - if you want to run your tests w/java 6 or 8, you can select the JDK
 version of your choice in the job configuration page (it'll be towards the
 top)

 these changes have actually all been tested against a variety of builds
 (yay staging!) and while i'm certain that i have all of the kinks worked
 out, i'm going to schedule a longer downtime so that i have a chance to
 identify and squash any problems that surface.

 thanks to josh rosen, k. shankari and davies liu for helping me test
 all of this and get it working.

 shane

>>>
>>>
>>
>


Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
For torrent broadcast, data are read directly through the block manager:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167



On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara  wrote:

> Thanks! I've found the fetcher! Is there any other places and cases where
> blocks are traveled through network?
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zv...@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-0021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)
>
> 2015-04-09 10:24 GMT+02:00 Reynold Xin :
>
>> Take a look at the following two files:
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
>>
>> and
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
>>
>> On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara 
>> wrote:
>>
>>> Dear Developers,
>>>
>>> I'm trying to investigate the communication pattern regarding data-flow
>>> during execution of a Spark program defined by an RDD chain. I'm
>>> investigating from the Task point of view, and found out that the task
>>> type
>>> ResultTask (as retrieving the iterator for its RDD for a given
>>> partition),
>>> effectively asks the BlockManager to get the block from local or remote
>>> location. What I do there is to include actual location data in
>>> BlockResult
>>> so the task can tell where it retrieved the data from. I've found out
>>> that
>>> ResultTask can issue a data-flow only in this case.
>>>
>>> What's the case with the ShuffleMapTask? What happens there? I'm trying
>>> to
>>> log locations which are included in the shuffle process. I would be happy
>>> to receive a few hints regarding where remote communication is managed in
>>> case of ShuffleMapTask.
>>>
>>> Thanks!
>>>
>>> Zoltán
>>>
>>
>>
>


Re: Connect to remote YARN cluster

2015-04-09 Thread Marcelo Vanzin
If YARN is authenticating users it's probably running on kerberos, so
you need to log in with your kerberos credentials (kinit) before
submitting an application.

On Thu, Apr 9, 2015 at 4:57 AM, Zoltán Zvara  wrote:
> I'm trying to debug Spark in yarn-client mode. On my local, single node
> cluster everything works fine, but the remote YARN resource manager throws
> away my request because of authentication error. I'm running IntelliJ 14 on
> Ubuntu and the driver tries to connect to YARN with my local user name. How
> can I force IntelliJ to run my code with a different user? Or how can I set
> up the connection to YARN RM with auth data?
>
> Thanks!
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zv...@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-0021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-09 Thread shane knapp
things are looking pretty good and i expect to be done within an hour.
 i've got some test builds running right now, and will give the green light
when they successfully complete.

On Thu, Apr 9, 2015 at 7:29 AM, shane knapp  wrote:

> and this is now happening.
>
> On Tue, Apr 7, 2015 at 4:38 PM, shane knapp  wrote:
>
>> reminder!  this is happening thurday morning.
>>
>> On Fri, Apr 3, 2015 at 9:59 AM, shane knapp  wrote:
>>
>>> welcome to python2.7+, java 8 and more!  :)
>>>
>>> i'll be doing a major upgrade to our build system next thursday morning.
>>>  here's a quick list of what's going on:
>>>
>>> * installation of anaconda python on all worker nodes
>>>
>>> * installation of pypy 2.5.1 (python 2.7) on all nodes
>>>
>>> * matching installation of python modules for the current system python
>>> (2.6), and anaconda python (2.6, 2.7 and 3.4)
>>>   - anaconda python 2.7 will be the default for all workers (this has
>>> stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
>>> and i've noticed no test failures)
>>>   - you can now use anaconda environments to specify which version of
>>> python to use in your tests:  http://www.continuum.io/blog/conda
>>>
>>> * installation of new python 2.7 modules:  pymongo requests six pymongo
>>> requests six python-crontab
>>>
>>> * bare-bones mongodb installation on all workers
>>>
>>> * installation of java 1.6 and 1.8 internal to jenkins
>>>   - jobs will default to the system java, which is 1.7.0_75
>>>   - if you want to run your tests w/java 6 or 8, you can select the JDK
>>> version of your choice in the job configuration page (it'll be towards the
>>> top)
>>>
>>> these changes have actually all been tested against a variety of builds
>>> (yay staging!) and while i'm certain that i have all of the kinks worked
>>> out, i'm going to schedule a longer downtime so that i have a chance to
>>> identify and squash any problems that surface.
>>>
>>> thanks to josh rosen, k. shankari and davies liu for helping me test all
>>> of this and get it working.
>>>
>>> shane
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-09 Thread Sean McNamara
+1 tested on OS X

Sean

> On Apr 7, 2015, at 11:46 PM, Patrick Wendell  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 1.3.1!
> 
> The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
> 
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc2/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1083/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
> 
> The patches on top of RC1 are:
> 
> [SPARK-6737] Fix memory leak in OutputCommitCoordinator
> https://github.com/apache/spark/pull/5397
> 
> [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
> https://github.com/apache/spark/pull/5302
> 
> [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
> NoClassDefFoundError
> https://github.com/apache/spark/pull/4933
> 
> Please vote on releasing this package as Apache Spark 1.3.1!
> 
> The vote is open until Saturday, April 11, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-09 Thread shane knapp
and this is now happening.

On Tue, Apr 7, 2015 at 4:38 PM, shane knapp  wrote:

> reminder!  this is happening thurday morning.
>
> On Fri, Apr 3, 2015 at 9:59 AM, shane knapp  wrote:
>
>> welcome to python2.7+, java 8 and more!  :)
>>
>> i'll be doing a major upgrade to our build system next thursday morning.
>>  here's a quick list of what's going on:
>>
>> * installation of anaconda python on all worker nodes
>>
>> * installation of pypy 2.5.1 (python 2.7) on all nodes
>>
>> * matching installation of python modules for the current system python
>> (2.6), and anaconda python (2.6, 2.7 and 3.4)
>>   - anaconda python 2.7 will be the default for all workers (this has
>> stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
>> and i've noticed no test failures)
>>   - you can now use anaconda environments to specify which version of
>> python to use in your tests:  http://www.continuum.io/blog/conda
>>
>> * installation of new python 2.7 modules:  pymongo requests six pymongo
>> requests six python-crontab
>>
>> * bare-bones mongodb installation on all workers
>>
>> * installation of java 1.6 and 1.8 internal to jenkins
>>   - jobs will default to the system java, which is 1.7.0_75
>>   - if you want to run your tests w/java 6 or 8, you can select the JDK
>> version of your choice in the job configuration page (it'll be towards the
>> top)
>>
>> these changes have actually all been tested against a variety of builds
>> (yay staging!) and while i'm certain that i have all of the kinks worked
>> out, i'm going to schedule a longer downtime so that i have a chance to
>> identify and squash any problems that surface.
>>
>> thanks to josh rosen, k. shankari and davies liu for helping me test all
>> of this and get it working.
>>
>> shane
>>
>
>


Re: Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara
Thanks! I've found the fetcher! Is there any other places and cases where
blocks are traveled through network?

Zvara Zoltán



mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-04-09 10:24 GMT+02:00 Reynold Xin :

> Take a look at the following two files:
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
>
> and
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
>
> On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara 
> wrote:
>
>> Dear Developers,
>>
>> I'm trying to investigate the communication pattern regarding data-flow
>> during execution of a Spark program defined by an RDD chain. I'm
>> investigating from the Task point of view, and found out that the task
>> type
>> ResultTask (as retrieving the iterator for its RDD for a given partition),
>> effectively asks the BlockManager to get the block from local or remote
>> location. What I do there is to include actual location data in
>> BlockResult
>> so the task can tell where it retrieved the data from. I've found out that
>> ResultTask can issue a data-flow only in this case.
>>
>> What's the case with the ShuffleMapTask? What happens there? I'm trying to
>> log locations which are included in the shuffle process. I would be happy
>> to receive a few hints regarding where remote communication is managed in
>> case of ShuffleMapTask.
>>
>> Thanks!
>>
>> Zoltán
>>
>
>


Connect to remote YARN cluster

2015-04-09 Thread Zoltán Zvara
I'm trying to debug Spark in yarn-client mode. On my local, single node
cluster everything works fine, but the remote YARN resource manager throws
away my request because of authentication error. I'm running IntelliJ 14 on
Ubuntu and the driver tries to connect to YARN with my local user name. How
can I force IntelliJ to run my code with a different user? Or how can I set
up the connection to YARN RM with auth data?

Thanks!

Zvara Zoltán



mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)


Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
Take a look at the following two files:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara  wrote:

> Dear Developers,
>
> I'm trying to investigate the communication pattern regarding data-flow
> during execution of a Spark program defined by an RDD chain. I'm
> investigating from the Task point of view, and found out that the task type
> ResultTask (as retrieving the iterator for its RDD for a given partition),
> effectively asks the BlockManager to get the block from local or remote
> location. What I do there is to include actual location data in BlockResult
> so the task can tell where it retrieved the data from. I've found out that
> ResultTask can issue a data-flow only in this case.
>
> What's the case with the ShuffleMapTask? What happens there? I'm trying to
> log locations which are included in the shuffle process. I would be happy
> to receive a few hints regarding where remote communication is managed in
> case of ShuffleMapTask.
>
> Thanks!
>
> Zoltán
>


Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara
Dear Developers,

I'm trying to investigate the communication pattern regarding data-flow
during execution of a Spark program defined by an RDD chain. I'm
investigating from the Task point of view, and found out that the task type
ResultTask (as retrieving the iterator for its RDD for a given partition),
effectively asks the BlockManager to get the block from local or remote
location. What I do there is to include actual location data in BlockResult
so the task can tell where it retrieved the data from. I've found out that
ResultTask can issue a data-flow only in this case.

What's the case with the ShuffleMapTask? What happens there? I'm trying to
log locations which are included in the shuffle process. I would be happy
to receive a few hints regarding where remote communication is managed in
case of ShuffleMapTask.

Thanks!

Zoltán