toLocalIterator creates as many jobs as # of partitions, and it ends up spamming Spark UI

2015-03-13 Thread Mingyu Kim
Hi all,

RDD.toLocalIterator() creates as many jobs as # of partitions and it spams 
Spark UI especially when the method is used on an RDD with hundreds or 
thousands of partitions.

Does anyone have a way to work around this issue? What do people think about 
introducing a SparkContext local property (analogous to “spark.scheduler.pool” 
set as a thread-local property) that determines if the job info should be shown 
on the Spark UI?

Thanks,
Mingyu


Re: Spilling when not expected

2015-03-13 Thread Reynold Xin
How did you run the Spark command? Maybe the memory setting didn't actually
apply? How much memory does the web ui say is available?

BTW - I don't think any JVM can actually handle 700G heap ... (maybe Zing).

On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen 
wrote:

> Hi all,
>
> I'm running the teraSort benchmark with a relative small input set: 5GB.
> During profiling, I can see I am using a total of 68GB. I've got a terabyte
> of memory in my system, and set
> spark.executor.memory 900g
> spark.driver.memory 900g
> I use the default for
> spark.shuffle.memoryFraction
> spark.storage.memoryFraction
> I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
> storage.
>
> I noticed a lot of variation in runtime (under the same load), and tracked
> this down to this function in
> core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
>   private def spillToPartitionFiles(collection:
> SizeTrackingPairCollection[(Int, K), C]): Unit = {
> spillToPartitionFiles(collection.iterator)
>   }
> In a slow run, it would loop through this function 12000 times, in a fast
> run only 700 times, even though the settings in both runs are the same and
> there are no other users on the system. When I look at the function calling
> this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
> is only called 700 times in both fast and slow runs, meaning that the
> function recursively calls itself very often. Because of the function name,
> I assume the system is spilling to disk. As I have sufficient memory, I
> assume that I forgot to set a certain memory setting. Anybody any idea
> which
> other setting I have to set, in order to not spill data in this scenario?
>
> Thanks,
>
> Tom
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


May we merge into branch-1.3 at this point?

2015-03-13 Thread Sean Owen
Is the release certain enough that we can resume merging into
branch-1.3 at this point? I have a number of back-ports queued up and
didn't want to merge in case another last RC was needed. I see a few
commits to the branch though.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark config option 'expression language' feedback request

2015-03-13 Thread Dale Richardson











PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to allow for 
Spark configuration options (whether on command line, environment variable or a 
configuration file) to be specified via a simple expression language.


Such a feature has the following end-user benefits:
- Allows for the flexibility in specifying time intervals or byte quantities in 
appropriate and easy to follow units e.g. 1 week rather rather then 604800 
seconds

- Allows for the scaling of a configuration option in relation to a system 
attributes. e.g.

SPARK_WORKER_CORES = numCores - 1

SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB

- Gives the ability to scale multiple configuration options together eg:

spark.driver.memory = 0.75 * physicalMemoryBytes

spark.driver.maxResultSize = spark.driver.memory * 0.8


The following functions are currently supported by this PR:
NumCores: Number of cores assigned to the JVM (usually == Physical 
machine cores)
PhysicalMemoryBytes:  Memory size of hosting machine

JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM

JVMMaxMemoryBytes:Maximum number of bytes of memory available to the JVM

JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes


I was wondering if anybody on the mailing list has any further ideas on other 
functions that could be useful to have when specifying spark configuration 
options?
Regards,Dale.
  

Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread Chester At Work
Reyonld, 

Prof Canny gives me the slides yesterday I will posted the link to the 
slides to both SF BIg Analytics and SF Machine Learning meetups.

Chester

Sent from my iPad

On Mar 12, 2015, at 22:53, Reynold Xin  wrote:

> Thanks for chiming in, John. I missed your meetup last night - do you have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
> 
> 
> 
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny  wrote:
> 
>> If you're contemplating GPU acceleration in Spark, its important to look
>> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
>> datasets we've tested in BIDMach, and we've tried to make them
>> representative of industry machine learning workloads. Unless you're
>> crunching images or audio, the majority of data will be very sparse and
>> power law distributed. You need a good sparse BLAS, and in practice it
>> seems
>> like you need a sparse BLAS tailored for power-law data. We had to write
>> our
>> own since the NVIDIA libraries didnt perform well on typical power-law
>> data.
>> Intel MKL sparse BLAS also have issues and we only use some of them.
>> 
>> You also need 2D reductions, scan operations, slicing, element-wise
>> transcendental functions and operators, many kinds of sort, random number
>> generators etc, and some kind of memory management strategy. Some of this
>> was layered on top of Thrust in BIDMat, but most had to be written from
>> scratch. Its all been rooflined, typically to memory throughput of current
>> GPUs (around 200 GB/s).
>> 
>> When you have all this you can write Learning Algorithms in the same
>> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
>> same in BIDMat, since the generic matrix operations are implemented on both
>> CPU and GPU, so the same code runs on either platform.
>> 
>> A lesser known fact is that GPUs are around 10x faster for *all* those
>> operations, not just dense BLAS. Its mostly due to faster streaming memory
>> speeds, but some kernels (random number generation and transcendentals) are
>> more than an order of magnitude thanks to some specialized hardware for
>> power series on the GPU chip.
>> 
>> When you have all this there is no need to move data back and forth across
>> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
>> and feed them to the available GPUs. Most models fit comfortably in GPU
>> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
>> data through the GPU this way.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Mridul Muralidharan
I am curious how you are going to support these over mesos and yarn.
Any configure change like this should be applicable to all of them, not
just local and standalone modes.

Regards
Mridul

On Friday, March 13, 2015, Dale Richardson  wrote:

>
>
>
>
>
>
>
>
>
>
>
> PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> allow for Spark configuration options (whether on command line, environment
> variable or a configuration file) to be specified via a simple expression
> language.
>
>
> Such a feature has the following end-user benefits:
> - Allows for the flexibility in specifying time intervals or byte
> quantities in appropriate and easy to follow units e.g. 1 week rather
> rather then 604800 seconds
>
> - Allows for the scaling of a configuration option in relation to a system
> attributes. e.g.
>
> SPARK_WORKER_CORES = numCores - 1
>
> SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>
> - Gives the ability to scale multiple configuration options together eg:
>
> spark.driver.memory = 0.75 * physicalMemoryBytes
>
> spark.driver.maxResultSize = spark.driver.memory * 0.8
>
>
> The following functions are currently supported by this PR:
> NumCores: Number of cores assigned to the JVM (usually ==
> Physical machine cores)
> PhysicalMemoryBytes:  Memory size of hosting machine
>
> JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>
> JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> JVM
>
> JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>
>
> I was wondering if anybody on the mailing list has any further ideas on
> other functions that could be useful to have when specifying spark
> configuration options?
> Regards,Dale.
>


Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Nicholas Chammas
Looks like the release is out:
http://spark.apache.org/releases/spark-release-1-3-0.html

Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
https://github.com/apache/spark/releases

Nick

On Fri, Mar 13, 2015 at 6:07 AM Sean Owen  wrote:

> Is the release certain enough that we can resume merging into
> branch-1.3 at this point? I have a number of back-ports queued up and
> didn't want to merge in case another last RC was needed. I see a few
> commits to the branch though.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Sean Owen
Yeah, I'm guessing that is all happening quite literally as we speak.
The Apache git tag is the one of reference:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc

Open season on 1.3 branch then...

On Fri, Mar 13, 2015 at 4:20 PM, Nicholas Chammas
 wrote:
> Looks like the release is out:
> http://spark.apache.org/releases/spark-release-1-3-0.html
>
> Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
> https://github.com/apache/spark/releases
>
> Nick
>
> On Fri, Mar 13, 2015 at 6:07 AM Sean Owen  wrote:
>>
>> Is the release certain enough that we can resume merging into
>> branch-1.3 at this point? I have a number of back-ports queued up and
>> didn't want to merge in case another last RC was needed. I see a few
>> commits to the branch though.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Mridul Muralidharan
Who is managing 1.3 release ? You might want to coordinate with them before
porting changes to branch.

Regards
Mridul

On Friday, March 13, 2015, Sean Owen  wrote:

> Yeah, I'm guessing that is all happening quite literally as we speak.
> The Apache git tag is the one of reference:
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>
> Open season on 1.3 branch then...
>
> On Fri, Mar 13, 2015 at 4:20 PM, Nicholas Chammas
> > wrote:
> > Looks like the release is out:
> > http://spark.apache.org/releases/spark-release-1-3-0.html
> >
> > Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
> > https://github.com/apache/spark/releases
> >
> > Nick
> >
> > On Fri, Mar 13, 2015 at 6:07 AM Sean Owen  > wrote:
> >>
> >> Is the release certain enough that we can resume merging into
> >> branch-1.3 at this point? I have a number of back-ports queued up and
> >> didn't want to merge in case another last RC was needed. I see a few
> >> commits to the branch though.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


Re: Spilling when not expected

2015-03-13 Thread Tom Hubregtsen
I use the spark-submit script and the config files in a conf directory. I
see the memory settings reflected in the stdout, as well as in the webUI.
(it prints all variables from spark-default.conf, and metions I have 540GB
free memory available when trying to store a broadcast variable or RDD). I
also run "ps -aux | grep java | grep th", which show me that I called java
with "-Xms1000g -Xmx1000g"

I also tested if these numbers are realistic for the J9 JVM. Outside of
Spark, when setting just the initial heapsize (Xms), it gives an error, but
if I also define the maximum option with it (Xmx), it seems to us that it
is accepting it. Also, in IBM's J9 health center, I see it reserve the
900g, and use up to 68g.

Thanks,

Tom

On 13 March 2015 at 02:05, Reynold Xin  wrote:

> How did you run the Spark command? Maybe the memory setting didn't
> actually apply? How much memory does the web ui say is available?
>
> BTW - I don't think any JVM can actually handle 700G heap ... (maybe Zing).
>
> On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen 
> wrote:
>
>> Hi all,
>>
>> I'm running the teraSort benchmark with a relative small input set: 5GB.
>> During profiling, I can see I am using a total of 68GB. I've got a
>> terabyte
>> of memory in my system, and set
>> spark.executor.memory 900g
>> spark.driver.memory 900g
>> I use the default for
>> spark.shuffle.memoryFraction
>> spark.storage.memoryFraction
>> I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
>> storage.
>>
>> I noticed a lot of variation in runtime (under the same load), and tracked
>> this down to this function in
>> core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
>>   private def spillToPartitionFiles(collection:
>> SizeTrackingPairCollection[(Int, K), C]): Unit = {
>> spillToPartitionFiles(collection.iterator)
>>   }
>> In a slow run, it would loop through this function 12000 times, in a fast
>> run only 700 times, even though the settings in both runs are the same and
>> there are no other users on the system. When I look at the function
>> calling
>> this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
>> is only called 700 times in both fast and slow runs, meaning that the
>> function recursively calls itself very often. Because of the function
>> name,
>> I assume the system is spilling to disk. As I have sufficient memory, I
>> assume that I forgot to set a certain memory setting. Anybody any idea
>> which
>> other setting I have to set, in order to not spill data in this scenario?
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread jfcanny
Hi Reynold,
I left Chester with a copy of the slides, so I assume they'll be posted 
on the SF ML or Big Data sites. We have a draft paper under review. I 
can ask the co-authors about arxiv'ing it.

We have a few heuristics for power-law data. One of them is to keep the 
feature set sorted by frequency. Power-law data has roughly the same 
mass in each power-of-two range of feature frequency. By keeping the 
most frequent features together, you get a lot more value out of the 
caches on the device (even GPUs have them, albeit smaller ones). e.g. 
with 100 million features, 1/2 of the feature instances will be in the 
range 1...,10,000. If they're consecutive they will all hit a fast 
cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc.

Another is to subdivide sparse matrices using the vector of elements 
rather than rows or columns. Splitting power-law matrices by either rows 
or columns gives very uneven splits. That means we store sparse matrices 
in coordinate form rather than compressed row or column format.

Other than that, rooflining gives you a goal that you should be able to 
reach. If you arent at the limit, just knowing that gives you a target 
to aim at. You can try profiling the kernel to figure out why its slower 
than it should be. There are a few common reasons (low occupancy, 
imbalanced thread blocks, thread divergence) that you can discover with 
the profiler. Then hopefully you can solve them.

-John


On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote:
> Thanks for chiming in, John. I missed your meetup last night - do you 
> have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
>
>
>
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[hidden email] 
> > wrote:
>
> > If you're contemplating GPU acceleration in Spark, its important to 
> look
> > beyond BLAS. Dense BLAS probably account for only 10% of the cycles 
> in the
> > datasets we've tested in BIDMach, and we've tried to make them
> > representative of industry machine learning workloads. Unless you're
> > crunching images or audio, the majority of data will be very sparse and
> > power law distributed. You need a good sparse BLAS, and in practice it
> > seems
> > like you need a sparse BLAS tailored for power-law data. We had to 
> write
> > our
> > own since the NVIDIA libraries didnt perform well on typical power-law
> > data.
> > Intel MKL sparse BLAS also have issues and we only use some of them.
> >
> > You also need 2D reductions, scan operations, slicing, element-wise
> > transcendental functions and operators, many kinds of sort, random 
> number
> > generators etc, and some kind of memory management strategy. Some of 
> this
> > was layered on top of Thrust in BIDMat, but most had to be written from
> > scratch. Its all been rooflined, typically to memory throughput of 
> current
> > GPUs (around 200 GB/s).
> >
> > When you have all this you can write Learning Algorithms in the same
> > high-level primitives available in Breeze or Numpy/Scipy. Its 
> literally the
> > same in BIDMat, since the generic matrix operations are implemented 
> on both
> > CPU and GPU, so the same code runs on either platform.
> >
> > A lesser known fact is that GPUs are around 10x faster for *all* those
> > operations, not just dense BLAS. Its mostly due to faster streaming 
> memory
> > speeds, but some kernels (random number generation and 
> transcendentals) are
> > more than an order of magnitude thanks to some specialized hardware for
> > power series on the GPU chip.
> >
> > When you have all this there is no need to move data back and forth 
> across
> > the PCI bus. The CPU only has to pull chunks of data off disk, 
> unpack them,
> > and feed them to the available GPUs. Most models fit comfortably in GPU
> > memory these days (4-12 GB). With minibatch algorithms you can push 
> TBs of
> > data through the GPU this way.
> >
> >
> >
> > --
> > View this message in context:
> > 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: [hidden email] 
> 
> > For additional commands, e-mail: [hidden email] 
> 
> >
> >
>
>
> 
> If you reply to this email, your message will be added to the 
> discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11022.html
>  
>
> To unsubscribe from Using CUDA within Spark / boosting linear algebra, 
> click here 
> 

[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-3-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Kushal Datta
Kudos to the whole team for such a significant achievement!

On Fri, Mar 13, 2015 at 10:00 AM, Patrick Wendell 
wrote:

> Hi All,
>
> I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
> the fourth release on the API-compatible 1.X line. It is Spark's
> largest release ever, with contributions from 172 developers and more
> than 1,000 commits!
>
> Visit the release notes [1] to read about the new features, or
> download [2] the release today.
>
> For errata in the contributions or release notes, please e-mail me
> *directly* (not on-list).
>
> Thanks to everyone who helped work on this release!
>
> [1] http://spark.apache.org/releases/spark-release-1-3-0.html
> [2] http://spark.apache.org/downloads.html
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


extended jenkins downtime monday, march 16th, plus some hints at the future

2015-03-13 Thread shane knapp
i'll be taking jenkins down for some much-needed plugin updates, as well as
potentially upgrading jenkins itself.

this will start at 730am PDT, and i'm hoping to have everything up by noon.

the move to the anaconda python will take place in the next couple of weeks
as i'm in the process of rebuilding my staging environment (much needed) to
better reflect production, and allow me to better test the change.

and finally, some teasers for what's coming up in the next month or so:

* move to a fully puppetized environment (yay no more shell script
deployments!)
* virtualized workers (including multiple OSes -- OS X, ubuntu, ...,
profit?)

more details as they come.

happy friday!

shane


Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Patrick Wendell
Hey Sean,

Yes, go crazy. Once we close the release vote, it's open season to
merge backports into that release.

- Patrick

On Fri, Mar 13, 2015 at 9:31 AM, Mridul Muralidharan  wrote:
> Who is managing 1.3 release ? You might want to coordinate with them before
> porting changes to branch.
>
> Regards
> Mridul
>
> On Friday, March 13, 2015, Sean Owen  wrote:
>
>> Yeah, I'm guessing that is all happening quite literally as we speak.
>> The Apache git tag is the one of reference:
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> Open season on 1.3 branch then...
>>
>> On Fri, Mar 13, 2015 at 4:20 PM, Nicholas Chammas
>> > wrote:
>> > Looks like the release is out:
>> > http://spark.apache.org/releases/spark-release-1-3-0.html
>> >
>> > Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
>> > https://github.com/apache/spark/releases
>> >
>> > Nick
>> >
>> > On Fri, Mar 13, 2015 at 6:07 AM Sean Owen > > wrote:
>> >>
>> >> Is the release certain enough that we can resume merging into
>> >> branch-1.3 at this point? I have a number of back-ports queued up and
>> >> didn't want to merge in case another last RC was needed. I see a few
>> >> commits to the branch though.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Reynold Xin
This is an interesting idea.

Are there well known libraries for doing this? Config is the one place
where it would be great to have something ridiculously simple, so it is
more or less bug free. I'm concerned about the complexity in this patch and
subtle bugs that it might introduce to config options that users will have
no workarounds. Also I believe it is fairly hard for nice error messages to
propagate when using Scala's parser combinator.


On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson 
wrote:

>
> PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> allow for Spark configuration options (whether on command line, environment
> variable or a configuration file) to be specified via a simple expression
> language.
>
>
> Such a feature has the following end-user benefits:
> - Allows for the flexibility in specifying time intervals or byte
> quantities in appropriate and easy to follow units e.g. 1 week rather
> rather then 604800 seconds
>
> - Allows for the scaling of a configuration option in relation to a system
> attributes. e.g.
>
> SPARK_WORKER_CORES = numCores - 1
>
> SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>
> - Gives the ability to scale multiple configuration options together eg:
>
> spark.driver.memory = 0.75 * physicalMemoryBytes
>
> spark.driver.maxResultSize = spark.driver.memory * 0.8
>
>
> The following functions are currently supported by this PR:
> NumCores: Number of cores assigned to the JVM (usually ==
> Physical machine cores)
> PhysicalMemoryBytes:  Memory size of hosting machine
>
> JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>
> JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> JVM
>
> JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>
>
> I was wondering if anybody on the mailing list has any further ideas on
> other functions that could be useful to have when specifying spark
> configuration options?
> Regards,Dale.
>


PR Builder timing out due to ivy cache lock

2015-03-13 Thread Hari Shreedharan
Looks like something is causing the PR Builder to timeout since this
morning with the ivy cache being locked.

Any idea what is happening?


Re: PR Builder timing out due to ivy cache lock

2015-03-13 Thread shane knapp
link to a build, please?

On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan <
hshreedha...@cloudera.com> wrote:

> Looks like something is causing the PR Builder to timeout since this
> morning with the ivy cache being locked.
>
> Any idea what is happening?
>


jenkins httpd being flaky

2015-03-13 Thread shane knapp
we just started having issues when visiting jenkins and getting 503 service
unavailable errors.

i'm on it and will report back with an all-clear.


Re: PR Builder timing out due to ivy cache lock

2015-03-13 Thread Hari Shreedharan
Here you are:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull

On Fri, Mar 13, 2015 at 11:58 AM, shane knapp  wrote:

> link to a build, please?
>
> On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan <
> hshreedha...@cloudera.com> wrote:
>
>> Looks like something is causing the PR Builder to timeout since this
>> morning with the ivy cache being locked.
>>
>> Any idea what is happening?
>>
>
>


Re: SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-13 Thread Michael Armbrust
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-6315

On Thu, Mar 12, 2015 at 11:00 PM, Michael Armbrust 
wrote:

> We are looking at the issue and will likely fix it for Spark 1.3.1.
>
> On Thu, Mar 12, 2015 at 8:25 PM, giive chen  wrote:
>
>> Hi all
>>
>> My team has the same issue. It looks like Spark 1.3's sparkSQL cannot read
>> parquet file generated by Spark 1.1. It will cost a lot of migration work
>> when we wanna to upgrade Spark 1.3.
>>
>> Is there  anyone can help me?
>>
>>
>> Thanks
>>
>> Wisely Chen
>>
>>
>> On Tue, Mar 10, 2015 at 5:06 PM, Pei-Lun Lee  wrote:
>>
>> > Hi,
>> >
>> > I found that if I try to read parquet file generated by spark 1.1.1
>> using
>> > 1.3.0-rc3 by default settings, I got this error:
>> >
>> > com.fasterxml.jackson.core.JsonParseException: Unrecognized token
>> > 'StructType': was expecting ('true', 'false' or 'null')
>> >  at [Source: StructType(List(StructField(a,IntegerType,false))); line:
>> 1,
>> > column: 11]
>> > at
>> >
>> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
>> > at
>> >
>> >
>> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
>> > at
>> >
>> >
>> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300)
>> > at
>> >
>> >
>> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459)
>> > at
>> >
>> >
>> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
>> > at
>> >
>> >
>> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105)
>> > at
>> >
>> >
>> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)
>> > at
>> >
>> >
>> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
>> > at
>> org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
>> > at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
>> > at
>> > org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41)
>> > at
>> >
>> >
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
>> > at
>> >
>> >
>> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
>> >
>> >
>> >
>> > this is how I save parquet file with 1.1.1:
>> >
>> > sql("select 1 as a").saveAsParquetFile("/tmp/foo")
>> >
>> >
>> >
>> > and this is the meta data of the 1.1.1 parquet file:
>> >
>> > creator: parquet-mr version 1.4.3
>> > extra:   org.apache.spark.sql.parquet.row.metadata =
>> > StructType(List(StructField(a,IntegerType,false)))
>> >
>> >
>> >
>> > by comparison, this is 1.3.0 meta:
>> >
>> > creator: parquet-mr version 1.6.0rc3
>> > extra:   org.apache.spark.sql.parquet.row.metadata =
>> > {"type":"struct","fields":[{"name":"a","type":"integer","nullable":t
>> > [more]...
>> >
>> >
>> >
>> > It looks like now ParquetRelation2 is used to load parquet file by
>> default
>> > and it only recognizes JSON format schema but 1.1.1 schema was case
>> class
>> > string format.
>> >
>> > Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I
>> > don't know the differences.
>> > Is this considered a bug? We have a lot of parquet files from 1.1.1,
>> should
>> > we disable data source api in order to read them if we want to upgrade
>> to
>> > 1.3?
>> >
>> > Thanks,
>> > --
>> > Pei-Lun
>> >
>>
>
>


Re: jenkins httpd being flaky

2015-03-13 Thread shane knapp
ok we have a few different things happening:

1) httpd on the jenkins master is randomly (though not currently) flaking
out and causing visits to the site to return a 503.  nothing in the logs
shows any problems.

2) there are some github timeouts, which i tracked down and think it's a
problem with github themselves (see:  https://status.github.com/ and scroll
down to 'mean hook delivery time')

3) we have one spark job w/a strange ivy lock issue, that i just
retriggered (https://github.com/apache/spark/pull/4964)

4) there's an errant, unkillable pull request builder job (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console
)

more updates forthcoming.

On Fri, Mar 13, 2015 at 12:04 PM, shane knapp  wrote:

> we just started having issues when visiting jenkins and getting 503
> service unavailable errors.
>
> i'm on it and will report back with an all-clear.
>


Spark ThriftServer encounter java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-03-13 Thread Andrew Lee
When Kerberos is enabled, I get the following exceptions. (Spark 1.2.1 git 
commit 








b6eaf77d4332bfb0a698849b1f5f917d20d70e97, Hive 0.13.1, Apache Hadoop 2.4.1) 
when starting Spark ThriftServer.
Command to start thriftserver
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
hive.server2.thrift.bind.host=$(hostname) --master yarn-client
Error message in spark.log

2015-03-13 18:26:05,363 ERROR 
org.apache.hive.service.cli.thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(93)) - Error: 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
at 
org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
at 
org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
at 
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
at java.lang.Thread.run(Thread.java:744)

I'm wondering if this is due to the same problem described in HIVE-8154 
HIVE-7620 due to an older code based for the Spark ThriftServer?
Any insights are appreciated. Currently, I can't get Spark ThriftServer to run 
against a Kerberos cluster (Apache 2.4.1).

My hive-site.xml looks like the following for spark/conf.









  hive.semantic.analyzer.factory.impl
  org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory


  hive.metastore.execute.setugi
  true


  hive.stats.autogather
  false


  hive.session.history.enabled
  true


  hive.querylog.location
  /home/hive/log/${user.name}


  hive.exec.local.scratchdir
  /tmp/hive/scratch/${user.name}


  hive.metastore.uris
  thrift://somehostname:9083



  hive.server2.authentication
  KERBEROS


  hive.server2.authentication.kerberos.principal
  ***


  hive.server2.authentication.kerberos.keytab
  ***


  hive.server2.thrift.sasl.qop
  auth
  Sasl QOP value; one of 'auth', 'auth-int' and 
'auth-conf'


  hive.server2.enable.impersonation
  Enable user impersonation for HiveServer2
  true



  hive.metastore.sasl.enabled
  true


  hive.metastore.kerberos.keytab.file
  ***


  hive.metastore.kerberos.principal
  ***


  hive.metastore.cache.pinobjtypes
  Table,Database,Type,FieldSchema,Order


  hdfs_sentinel_file
  ***


  hive.metastore.warehouse.dir
  /hive


  hive.metastore.client.socket.timeout
  600


  hive.warehouse.subdir.inherit.perms
  true
   

Re: jenkins httpd being flaky

2015-03-13 Thread shane knapp
i tried a couple of things, but will also be doing a jenkins reboot as soon
as the current batch of builds finish.



On Fri, Mar 13, 2015 at 12:40 PM, shane knapp  wrote:

> ok we have a few different things happening:
>
> 1) httpd on the jenkins master is randomly (though not currently) flaking
> out and causing visits to the site to return a 503.  nothing in the logs
> shows any problems.
>
> 2) there are some github timeouts, which i tracked down and think it's a
> problem with github themselves (see:  https://status.github.com/ and
> scroll down to 'mean hook delivery time')
>
> 3) we have one spark job w/a strange ivy lock issue, that i just
> retriggered (https://github.com/apache/spark/pull/4964)
>
> 4) there's an errant, unkillable pull request builder job (
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console
> )
>
> more updates forthcoming.
>
> On Fri, Mar 13, 2015 at 12:04 PM, shane knapp  wrote:
>
>> we just started having issues when visiting jenkins and getting 503
>> service unavailable errors.
>>
>> i'm on it and will report back with an all-clear.
>>
>
>


Re: jenkins httpd being flaky

2015-03-13 Thread shane knapp
ok, things seem to have stabilized...  httpd hasn't flaked since ~noon, the
hanging PRB job on amp-jenkins-worker-06 was removed w/the restart and
things are now building.

i cancelled and retriggered a bunch of PRB builds, btw:
4848 (https://github.com/apache/spark/pull/3699)
5922 (https://github.com/apache/spark/pull/4733)
5987 (https://github.com/apache/spark/pull/4986)
6222 (https://github.com/apache/spark/pull/4964)
6325 (https://github.com/apache/spark/pull/5018)

as well as:
spark-master-maven-with-yarn

sorry for the inconvenience...  i'm still a little stumped as to what
happened, but i think it was a confluence of events (httpd flaking,
problems at github, mercury in retrograde, friday thinking it's monday).

shane

On Fri, Mar 13, 2015 at 1:08 PM, shane knapp  wrote:

> i tried a couple of things, but will also be doing a jenkins reboot as
> soon as the current batch of builds finish.
>
>
>
> On Fri, Mar 13, 2015 at 12:40 PM, shane knapp  wrote:
>
>> ok we have a few different things happening:
>>
>> 1) httpd on the jenkins master is randomly (though not currently) flaking
>> out and causing visits to the site to return a 503.  nothing in the logs
>> shows any problems.
>>
>> 2) there are some github timeouts, which i tracked down and think it's a
>> problem with github themselves (see:  https://status.github.com/ and
>> scroll down to 'mean hook delivery time')
>>
>> 3) we have one spark job w/a strange ivy lock issue, that i just
>> retriggered (https://github.com/apache/spark/pull/4964)
>>
>> 4) there's an errant, unkillable pull request builder job (
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28574/console
>> )
>>
>> more updates forthcoming.
>>
>> On Fri, Mar 13, 2015 at 12:04 PM, shane knapp 
>> wrote:
>>
>>> we just started having issues when visiting jenkins and getting 503
>>> service unavailable errors.
>>>
>>> i'm on it and will report back with an all-clear.
>>>
>>
>>
>


Re: PR Builder timing out due to ivy cache lock

2015-03-13 Thread shane knapp
i'm thinking that this was something transient, and hopefully won't happen
again.  a ton of weird stuff happened around the time of this failure (see
my flaky httpd email), and this was the only build exhibiting this behavior.

i'll keep an eye out for this failure over the weekend...



On Fri, Mar 13, 2015 at 12:03 PM, Hari Shreedharan <
hshreedha...@cloudera.com> wrote:

> Here you are:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull
>
> On Fri, Mar 13, 2015 at 11:58 AM, shane knapp  wrote:
>
>> link to a build, please?
>>
>> On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan <
>> hshreedha...@cloudera.com> wrote:
>>
>>> Looks like something is causing the PR Builder to timeout since this
>>> morning with the ivy cache being locked.
>>>
>>> Any idea what is happening?
>>>
>>
>>
>


RE: Spark config option 'expression language' feedback request

2015-03-13 Thread Dale Richardson



Hi Reynold,They are some very good questions.
Re: Known libraries
There are a number of well known libraries that we could use to implement this 
features, including MVEL, OGNL and JBOSS EL, or even Spring's EL.I looked at 
using them to prototype this feature in the beginning, but they all ended up 
bringing in a lot of code to service a pretty small functional requirement.The 
prime requirement I was trying to meet was:
1. Be able to specify quantities in kb,mb,gb etc transparently.2. Be able to 
specify some options as fractions of system attributes eg cpuCores * 0.8
By just implementing this functionality and nothing else I figured I was 
constraining things enough that end-users got useful functionality but not 
enough functionality to shoot themselves in the foot in new and interesting 
ways. I couldn't see a nice way of limiting the expressiveness of 3rd party 
libraries to this extent.
I'd be happy to re-look at the feasibility of pulling in one of the 3rd party 
libraries if you think this approach has more merit, but I do caution that we 
may be opening a Pandora's box of potential functionality.  Those 3rd party 
libraries have a lot of (potentially excess) functionality in them.
Re: Code ComplexityI wrote the bare minimum code I could come up with to 
service the above mentioned functionality, and then refactored it to use a 
stacked traits pattern which increased the code size by about a further 30%.  
The expression code as it stands is pretty minimal, and has more then 120 unit 
tests proving its functionality. More then half the code that is there is taken 
up by utility classes to allow easy reference to byte quantities and time 
units. The design was deliberately limited to meeting the above requirements 
and not much more to reduce the chance for other subtleties to raise their 
heads. 
Re: Work arounds.It would be pretty simple to implement fall back functionality 
to disable expression parsing by:1. Globally having a configuration option to 
disable all expression parsing and fall back to simple java property parsing.2. 
Locally having a known prefix that disables expression parsing for that 
option.This should give enough workarounds to keep things running in the 
unlikely event that something crops up no matter what happens.
Re: Error messagesIn regards to your comment about nice error messages I would 
have to agree with you, it would have been nice.  In the end I just return an 
option[Double] to the calling code for the parsed expression if the entire 
string is parsed correctly. Given the additional complexity adding error 
messages involved I retrospectively justify this by saying how much info do you 
need debug an expression like 'cpuCores * 0.8'? :)
Thanks for the feedback.
Regards,Dale.
> From: r...@databricks.com
> Date: Fri, 13 Mar 2015 11:26:44 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> To: dale...@hotmail.com
> CC: dev@spark.apache.org
> 
> This is an interesting idea.
> 
> Are there well known libraries for doing this? Config is the one place
> where it would be great to have something ridiculously simple, so it is
> more or less bug free. I'm concerned about the complexity in this patch and
> subtle bugs that it might introduce to config options that users will have
> no workarounds. Also I believe it is fairly hard for nice error messages to
> propagate when using Scala's parser combinator.
> 
> 
> On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson 
> wrote:
> 
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line, environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores: Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when speci

RE: Spark config option 'expression language' feedback request

2015-03-13 Thread Dale Richardson



Thanks for your questions Mridul.
I assume you are referring to how the functionality to query system state works 
in Yarn and Mesos?
The API's used are the standard JVM API's so the functionality will work 
without change. There is no real use case for using 'physicalMemoryBytes' in 
these cases though, as the JVM size has already been limited by the resource 
manager.
Regards,Dale.
> Date: Fri, 13 Mar 2015 08:20:33 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> From: mri...@gmail.com
> To: dale...@hotmail.com
> CC: dev@spark.apache.org
> 
> I am curious how you are going to support these over mesos and yarn.
> Any configure change like this should be applicable to all of them, not
> just local and standalone modes.
> 
> Regards
> Mridul
> 
> On Friday, March 13, 2015, Dale Richardson  wrote:
> 
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line, environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores: Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when specifying spark
> > configuration options?
> > Regards,Dale.
> >

  

Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Mridul Muralidharan
Let me try to rephrase my query.
How can a user specify, for example, what the executor memory should
be or number of cores should be.

I dont want a situation where some variables can be specified using
one set of idioms (from this PR for example) and another set cannot
be.


Regards,
Mridul




On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson  wrote:
>
>
>
> Thanks for your questions Mridul.
> I assume you are referring to how the functionality to query system state 
> works in Yarn and Mesos?
> The API's used are the standard JVM API's so the functionality will work 
> without change. There is no real use case for using 'physicalMemoryBytes' in 
> these cases though, as the JVM size has already been limited by the resource 
> manager.
> Regards,Dale.
>> Date: Fri, 13 Mar 2015 08:20:33 -0700
>> Subject: Re: Spark config option 'expression language' feedback request
>> From: mri...@gmail.com
>> To: dale...@hotmail.com
>> CC: dev@spark.apache.org
>>
>> I am curious how you are going to support these over mesos and yarn.
>> Any configure change like this should be applicable to all of them, not
>> just local and standalone modes.
>>
>> Regards
>> Mridul
>>
>> On Friday, March 13, 2015, Dale Richardson  wrote:
>>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
>> > allow for Spark configuration options (whether on command line, environment
>> > variable or a configuration file) to be specified via a simple expression
>> > language.
>> >
>> >
>> > Such a feature has the following end-user benefits:
>> > - Allows for the flexibility in specifying time intervals or byte
>> > quantities in appropriate and easy to follow units e.g. 1 week rather
>> > rather then 604800 seconds
>> >
>> > - Allows for the scaling of a configuration option in relation to a system
>> > attributes. e.g.
>> >
>> > SPARK_WORKER_CORES = numCores - 1
>> >
>> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>> >
>> > - Gives the ability to scale multiple configuration options together eg:
>> >
>> > spark.driver.memory = 0.75 * physicalMemoryBytes
>> >
>> > spark.driver.maxResultSize = spark.driver.memory * 0.8
>> >
>> >
>> > The following functions are currently supported by this PR:
>> > NumCores: Number of cores assigned to the JVM (usually ==
>> > Physical machine cores)
>> > PhysicalMemoryBytes:  Memory size of hosting machine
>> >
>> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>> >
>> > JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
>> > JVM
>> >
>> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>> >
>> >
>> > I was wondering if anybody on the mailing list has any further ideas on
>> > other functions that could be useful to have when specifying spark
>> > configuration options?
>> > Regards,Dale.
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark config option 'expression language' feedback request

2015-03-13 Thread Dale Richardson
Mridul,I may have added some confusion by giving examples in completely 
different areas. For example the number of cores available for tasking on each 
worker machine is a resource-controller level configuration variable. In 
standalone mode (ie using Spark's home-grown resource manager) the 
configuration variable SPARK_WORKER_CORES is an item that spark admins can set 
(and we can use expressions for). The equivalent variable for YARN 
(Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager 
setup and is set by Yarn administrators and outside of control of spark (and 
most users).  If you are not a cluster administrator then both variables are 
irrelevant to you. The same goes for SPARK_WORKER_MEMORY.

As for spark.executor.memory,  As there is no way to know the attributes of a 
machine before a task is allocated to it, we cannot use any of the JVMInfo 
functions. For options like that the expression parser can easily be limited to 
supporting different byte units of scale (kb/mb/gb etc) and other configuration 
variables only.  
Regards,Dale.




> Date: Fri, 13 Mar 2015 17:30:51 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> From: mri...@gmail.com
> To: dale...@hotmail.com
> CC: dev@spark.apache.org
> 
> Let me try to rephrase my query.
> How can a user specify, for example, what the executor memory should
> be or number of cores should be.
> 
> I dont want a situation where some variables can be specified using
> one set of idioms (from this PR for example) and another set cannot
> be.
> 
> 
> Regards,
> Mridul
> 
> 
> 
> 
> On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson  wrote:
> >
> >
> >
> > Thanks for your questions Mridul.
> > I assume you are referring to how the functionality to query system state 
> > works in Yarn and Mesos?
> > The API's used are the standard JVM API's so the functionality will work 
> > without change. There is no real use case for using 'physicalMemoryBytes' 
> > in these cases though, as the JVM size has already been limited by the 
> > resource manager.
> > Regards,Dale.
> >> Date: Fri, 13 Mar 2015 08:20:33 -0700
> >> Subject: Re: Spark config option 'expression language' feedback request
> >> From: mri...@gmail.com
> >> To: dale...@hotmail.com
> >> CC: dev@spark.apache.org
> >>
> >> I am curious how you are going to support these over mesos and yarn.
> >> Any configure change like this should be applicable to all of them, not
> >> just local and standalone modes.
> >>
> >> Regards
> >> Mridul
> >>
> >> On Friday, March 13, 2015, Dale Richardson  wrote:
> >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> >> > allow for Spark configuration options (whether on command line, 
> >> > environment
> >> > variable or a configuration file) to be specified via a simple expression
> >> > language.
> >> >
> >> >
> >> > Such a feature has the following end-user benefits:
> >> > - Allows for the flexibility in specifying time intervals or byte
> >> > quantities in appropriate and easy to follow units e.g. 1 week rather
> >> > rather then 604800 seconds
> >> >
> >> > - Allows for the scaling of a configuration option in relation to a 
> >> > system
> >> > attributes. e.g.
> >> >
> >> > SPARK_WORKER_CORES = numCores - 1
> >> >
> >> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >> >
> >> > - Gives the ability to scale multiple configuration options together eg:
> >> >
> >> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >> >
> >> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >> >
> >> >
> >> > The following functions are currently supported by this PR:
> >> > NumCores: Number of cores assigned to the JVM (usually ==
> >> > Physical machine cores)
> >> > PhysicalMemoryBytes:  Memory size of hosting machine
> >> >
> >> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >> >
> >> > JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> >> > JVM
> >> >
> >> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >> >
> >> >
> >> > I was wondering if anybody on the mailing list has any further ideas on
> >> > other functions that could be useful to have when specifying spark
> >> > configuration options?
> >> > Regards,Dale.
> >> >
> >
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>