Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
ok so let me try again ;-)

I don't think that the page size calculation matters apart from hitting the
allocation limit earlier if the page size is too large.

If a task is going to need X bytes, it is going to need X bytes. In this
case, for at least one of the tasks, X > maxmemory/no_active_tasks at some
point during execution. A smaller page size may use the memory more
efficiently but would not necessarily avoid this issue.

The next question would be: Is the memory limit per task of
max_memory/no_active_tasks reasonable? It seems fair but if this limit is
reached currently an exception is thrown, maybe the task could wait for
no_active_tasks to decrease?

I think what causes my test issue is that the 32 tasks don't execute as
quickly on my 8 core box so more are active at any one time.

I will experiment with the page size calculation to see what effect it has.

Cheers,



On 16 September 2015 at 06:53, Reynold Xin  wrote:

> It is exactly the issue here, isn't it?
>
> We are using memory / N, where N should be the maximum number of active
> tasks. In the current master, we use the number of cores to approximate the
> number of tasks -- but it turned out to be a bad approximation in tests
> because it is set to 32 to increase concurrency.
>
>
> On Tue, Sep 15, 2015 at 10:47 PM, Pete Robbins 
> wrote:
>
>> Oops... I meant to say "The page size calculation is NOT the issue here"
>>
>> On 16 September 2015 at 06:46, Pete Robbins  wrote:
>>
>>> The page size calculation is the issue here as there is plenty of free
>>> memory, although there is maybe a fair bit of wasted space in some pages.
>>> It is that when we have a lot of tasks each is only allowed to reach 1/n of
>>> the available memory and several of the tasks bump in to that limit. With
>>> tasks 4 times the number of cores there will be some contention and so they
>>> remain active for longer.
>>>
>>> So I think this is a test case issue configuring the number of executors
>>> too high.
>>>
>>> On 15 September 2015 at 18:54, Reynold Xin  wrote:
>>>
 Maybe we can change the heuristics in memory calculation to use
 SparkContext.defaultParallelism if it is local mode.


 On Tue, Sep 15, 2015 at 10:28 AM, Pete Robbins 
 wrote:

> Yes and at least there is an override by setting
> spark.sql.test.master to local[8] , in fact local[16] worked on my 8 core
> box.
>
> I'm happy to use this as a workaround but the 32 hard-coded will fail
> running build/tests on a clean checkout if you only have 8 cores.
>
> On 15 September 2015 at 17:40, Marcelo Vanzin 
> wrote:
>
>> That test explicitly sets the number of executor cores to 32.
>>
>> object TestHive
>>   extends TestHiveContext(
>> new SparkContext(
>>   System.getProperty("spark.sql.test.master", "local[32]"),
>>
>>
>> On Mon, Sep 14, 2015 at 11:22 PM, Reynold Xin 
>> wrote:
>> > Yea I think this is where the heuristics is failing -- it uses 8
>> cores to
>> > approximate the number of active tasks, but the tests somehow is
>> using 32
>> > (maybe because it explicitly sets it to that, or you set it
>> yourself? I'm
>> > not sure which one)
>> >
>> > On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins 
>> wrote:
>> >>
>> >> Reynold, thanks for replying.
>> >>
>> >> getPageSize parameters: maxMemory=515396075, numCores=0
>> >> Calculated values: cores=8, default=4194304
>> >>
>> >> So am I getting a large page size as I only have 8 cores?
>> >>
>> >> On 15 September 2015 at 00:40, Reynold Xin 
>> wrote:
>> >>>
>> >>> Pete - can you do me a favor?
>> >>>
>> >>>
>> >>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174
>> >>>
>> >>> Print the parameters that are passed into the getPageSize
>> function, and
>> >>> check their values.
>> >>>
>> >>> On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin 
>> wrote:
>> 
>>  Is this on latest master / branch-1.5?
>> 
>>  out of the box we reserve only 16% (0.2 * 0.8) of the memory for
>>  execution (e.g. aggregate, join) / shuffle sorting. With a 3GB
>> heap, that's
>>  480MB. So each task gets 480MB / 32 = 15MB, and each operator
>> reserves at
>>  least one page for execution. If your page size is 4MB, it only
>> takes 3
>>  operators to use up its memory.
>> 
>>  The thing is page size is dynamically determined -- and in your
>> case it
>>  should be smaller than 4MB.
>> 
>> 

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
I see what you are saying. Full stack trace:

java.io.IOException: Unable to acquire 4194304 bytes of memory
  at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
  at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:349)
  at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:478)
  at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
1.org
$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:88)
  at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.lang.Thread.run(Thread.java:785)

On 16 September 2015 at 09:30, Reynold Xin  wrote:

> Can you paste the entire stacktrace of the error? In your original email
> you only included the last function call.
>
> Maybe I'm missing something here, but I still think the bad heuristics is
> the issue.
>
> Some operators pre-reserve memory before running anything in order to
> avoid starvation. For example, imagine we have an aggregate followed by a
> sort. If the aggregate is very high cardinality, and uses up all the memory
> and even starts spilling (falling back to sort-based aggregate), there
> isn't memory available at all for the sort operator to use. To work around
> this, each operator reserves a page of memory before they process any data.
>
> Page size is computed by Spark using:
>
> the total amount of execution memory / (maximum number of active tasks *
> 16)
>
> and then rounded to the next power of 2, and cap between 1MB and 64MB.
>
> That is to say, in the worst case, we should be able to reserve at least 8
> pages (16 rounded up to the next power of 2).
>
> However, in your case, the max number of active tasks is 32 (set by test
> env), while the page size is determined using # cores (8 in your case). So
> it is off by a factor of 4. As a 

JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-16 Thread Juan Rodríguez Hortalá
Hi,

Sorry to insist, anyone has any thoughts on this? Or at least someone can
point me to a documentation of DStream.compute() so I can understand when I
should return None for a batch?

Thanks

Juan


2015-09-14 20:51 GMT+02:00 Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com>:

> Hi,
>
> I sent this message to the user list a few weeks ago with no luck, so I'm
> forwarding it to the dev list in case someone could give a hand with this.
> Thanks a lot in advance
>
>
> I've developed a ScalaCheck property for testing Spark Streaming
> transformations. To do that I had to develop a custom InputDStream, which
> is very similar to QueueInputDStream but has a method for adding new test
> cases for dstreams, which are objects of type Seq[Seq[A]], to the DStream.
> You can see the code at
> https://github.com/juanrh/sscheck/blob/32c2bff66aa5500182e0162a24ecca6d47707c42/src/main/scala/org/apache/spark/streaming/dstream/DynSeqQueueInputDStream.scala.
> I have developed a few properties that run in local mode
> https://github.com/juanrh/sscheck/blob/32c2bff66aa5500182e0162a24ecca6d47707c42/src/test/scala/es/ucm/fdi/sscheck/spark/streaming/ScalaCheckStreamingTest.scala.
> The problem is that when the batch interval is too small, and the machine
> cannot complete the batches fast enough, I get the following exceptions in
> the Spark log
>
> 15/08/26 11:22:02 ERROR JobScheduler: Error generating jobs for time
> 1440580922500 ms
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$count$1$$anonfun$apply$18.apply(DStream.scala:587)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$count$1$$anonfun$apply$18.apply(DStream.scala:587)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:668)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:666)
> at
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
> at
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
> at
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> 

Re: RDD API patterns

2015-09-16 Thread robineast
I'm not sure the problem is quite as bad as you state. Both sampleByKey and
sampleByKeyExact are implemented using a function from
StratifiedSamplingUtils which does one of two things depending on whether
the exact implementation is needed. The exact version requires double the
number of lines of code (17) than the non-exact and has to do extra passes
over the data to get, for example, the counts per key.

As far as I can see your problem 2 and sampleByKeyExact are very similar and
could be solved the same way. It has been decided that sampleByKeyExact is a
widely useful function and so is provided out of the box as part of the
PairRDD API. I don't see any reason why your problem 2 couldn't be provided
in the same way as part of the API if there was the demand for it. 

An alternative design would perhaps be something like an extension to
PairRDD, let's call it TwoPassPairRDD, where certain information for the key
could be provided along with an Iterable e.g. the counts for the key. Both
sampleByKeyExact and your problem 2 could be implemented in a few less lines
of code.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread shane knapp
good morning, denizens of the aether!

your hard working build system (and some associated infrastructure)
has been in need of some updates and housecleaning for quite a while
now.  we will be splitting the maintenance over two mornings to
minimize impact.

here's the plan:

7am-9am wednesday, 9-24-15  (or 24-9-15 for those not in amurrica):
* firewall taken offline for system and firewall updates
* expected downtime:  maybe an hour, but we'll say two just in case
* this will be done by jkuroda (CCed on this message)

630am-10am thursday, 9-24-15:
* jenknins update to 1.629 (we're a few months behind in versions, and
some big bugs have been fixed)
* jenkins master and worker system package updates
* all systems get a reboot (lots of hanging java processes have been
building up over the months)
* builds will stop being accepted ~630am, and i'll kill any hangers-on
at 730am, and retrigger once we're done
* expected downtime:  3.5 hours or so
* i will also be testing out some of my shiny new ansible playbooks
for the system updates!


please let me know if you have any questions, or requests to postpone
this maintenance.  thanks in advance!

shane & jon

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
so forcing the ShuffleMemoryManager to assume 32 cores and therefore
calculate a pagesize of 1MB passes the tests.

How can we determine the correct value to use in getPageSize rather than
Runtime.getRuntime.availableProcessors()?

On 16 September 2015 at 10:17, Pete Robbins  wrote:

> I see what you are saying. Full stack trace:
>
> java.io.IOException: Unable to acquire 4194304 bytes of memory
>   at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:349)
>   at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:478)
>   at
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
> 1.org
> $apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
>
> On 16 September 2015 at 09:30, Reynold Xin  wrote:
>
>> Can you paste the entire stacktrace of the error? In your original email
>> you only included the last function call.
>>
>> Maybe I'm missing something here, but I still think the bad heuristics is
>> the issue.
>>
>> Some operators pre-reserve memory before running anything in order to
>> avoid starvation. For example, imagine we have an aggregate followed by a
>> sort. If the aggregate is very high cardinality, and uses up all the memory
>> and even starts spilling (falling back to sort-based aggregate), there
>> isn't memory available at all for the sort operator to use. To work around
>> this, each operator reserves a page of memory before they process any data.
>>
>> Page size is computed by Spark using:
>>
>> the total amount 

Re: SparkR streaming source code

2015-09-16 Thread Reynold Xin
You should reach out to the speakers directly.


On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong  wrote:

> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
> share source code? (could not find it on GitHub)
>
>
>
> https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
>
>
> Thanks,
>
> Renyi.
>


Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread Reynold Xin
Thanks Shane and Jon for the heads up.

On Wednesday, September 16, 2015, shane knapp  wrote:

> good morning, denizens of the aether!
>
> your hard working build system (and some associated infrastructure)
> has been in need of some updates and housecleaning for quite a while
> now.  we will be splitting the maintenance over two mornings to
> minimize impact.
>
> here's the plan:
>
> 7am-9am wednesday, 9-24-15  (or 24-9-15 for those not in amurrica):
> * firewall taken offline for system and firewall updates
> * expected downtime:  maybe an hour, but we'll say two just in case
> * this will be done by jkuroda (CCed on this message)
>
> 630am-10am thursday, 9-24-15:
> * jenknins update to 1.629 (we're a few months behind in versions, and
> some big bugs have been fixed)
> * jenkins master and worker system package updates
> * all systems get a reboot (lots of hanging java processes have been
> building up over the months)
> * builds will stop being accepted ~630am, and i'll kill any hangers-on
> at 730am, and retrigger once we're done
> * expected downtime:  3.5 hours or so
> * i will also be testing out some of my shiny new ansible playbooks
> for the system updates!
>
>
> please let me know if you have any questions, or requests to postpone
> this maintenance.  thanks in advance!
>
> shane & jon
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


Communication between executors and drivers

2015-09-16 Thread Muhammad Haseeb Javed
How do executors communicate with the driver in Spark ? I understand that
it s done using Akka actors and messages are exchanged as
CoarseGrainedSchedulerMessage, but I'd really appreciate if someone could
explain the entire process in a bit detail.


Spark streaming DStream state on worker

2015-09-16 Thread Renyi Xiong
Hi,

I want to do temporal join operation on DStream across RDDs, my question
is: Are RDDs from same DStream always computed on same worker (except
failover) ?

thanks,
Renyi.


Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread shane knapp
> 630am-10am thursday, 9-24-15:
> * jenknins update to 1.629 (we're a few months behind in versions, and
> some big bugs have been fixed)
> * jenkins master and worker system package updates
> * all systems get a reboot (lots of hanging java processes have been
> building up over the months)
> * builds will stop being accepted ~630am, and i'll kill any hangers-on
> at 730am, and retrigger once we're done
> * expected downtime:  3.5 hours or so
> * i will also be testing out some of my shiny new ansible playbooks
> for the system updates!
>
i forgot one thing:

* moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR streaming source code

2015-09-16 Thread Renyi Xiong
got it, thanks a lot!

On Wed, Sep 16, 2015 at 10:14 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think Hao posted a link to the source code in the description of
> https://issues.apache.org/jira/browse/SPARK-6803
>
> On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin  wrote:
> > You should reach out to the speakers directly.
> >
> >
> > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong 
> wrote:
> >>
> >> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
> >> share source code? (could not find it on GitHub)
> >>
> >>
> >>
> >>
> https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
> >>
> >>
> >> Thanks,
> >>
> >> Renyi.
> >
> >
>


Re: Enum parameter in ML

2015-09-16 Thread Stephen Boesch
There was a long thread about enum's initiated by Xiangrui several months
back in which the final consensus was to use java enum's.  Is that
discussion (/decision) applicable here?

2015-09-16 17:43 GMT-07:00 Ulanov, Alexander :

> Hi Joseph,
>
>
>
> Strings sounds reasonable. However, there is no StringParam (only
> StringArrayParam). Should I create a new param type? Also, how can the user
> get all possible values of String parameter?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley [mailto:jos...@databricks.com]
> *Sent:* Wednesday, September 16, 2015 5:35 PM
> *To:* Feynman Liang
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Enum parameter in ML
>
>
>
> I've tended to use Strings.  Params can be created with a validator
> (isValid) which can ensure users get an immediate error if they try to pass
> an unsupported String.  Not as nice as compile-time errors, but easier on
> the APIs.
>
>
>
> On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang 
> wrote:
>
> We usually write a Java test suite which exercises the public API (e.g.
> DCT
> 
> ).
>
>
>
> It may be possible to create a sealed trait with singleton concrete
> instances inside of a serializable companion object, the just introduce a
> Param[SealedTrait] to the model (e.g. StreamingDecay PR
> ).
> However, this would require Java users to use
> CompanionObject$.ConcreteInstanceName to access enum values which isn't the
> prettiest syntax.
>
>
>
> Another option would just be to use Strings, which although is not type
> safe does simplify implementation.
>
>
>
> On Mon, Sep 14, 2015 at 5:43 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Hi Feynman,
>
>
>
> Thank you for suggestion. How can I ensure that there will be no problems
> for Java users? (I only use Scala API)
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Monday, September 14, 2015 5:27 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Enum parameter in ML
>
>
>
> Since PipelineStages are serializable, the params must also be
> serializable. We also have to keep the Java API in mind. Introducing a new
> enum Param type may work, but we will have to ensure that Java users can
> use it without dealing with ClassTags (I believe Scala will create new
> types for each possible value in the Enum) and that it can be serialized.
>
>
>
> On Mon, Sep 14, 2015 at 4:31 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> I am currently implementing the Estimator in ML that has a parameter that
> can take several different values that are mutually exclusive. The most
> appropriate type seems to be Scala Enum (
> http://www.scala-lang.org/api/current/index.html#scala.Enumeration).
> However, the current ML API has the following parameter types:
>
> BooleanParam, DoubleArrayParam, DoubleParam, FloatParam, IntArrayParam,
> IntParam, LongParam, StringArrayParam
>
>
>
> Should I introduce a new parameter type in ML API that is based on Scala
> Enum?
>
>
>
> Best regards, Alexander
>
>
>
>
>
>
>


Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
I've tended to use Strings.  Params can be created with a validator
(isValid) which can ensure users get an immediate error if they try to pass
an unsupported String.  Not as nice as compile-time errors, but easier on
the APIs.

On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang 
wrote:

> We usually write a Java test suite which exercises the public API (e.g.
> DCT
> 
> ).
>
> It may be possible to create a sealed trait with singleton concrete
> instances inside of a serializable companion object, the just introduce a
> Param[SealedTrait] to the model (e.g. StreamingDecay PR
> ).
> However, this would require Java users to use
> CompanionObject$.ConcreteInstanceName to access enum values which isn't the
> prettiest syntax.
>
> Another option would just be to use Strings, which although is not type
> safe does simplify implementation.
>
> On Mon, Sep 14, 2015 at 5:43 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Hi Feynman,
>>
>>
>>
>> Thank you for suggestion. How can I ensure that there will be no problems
>> for Java users? (I only use Scala API)
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Feynman Liang [mailto:fli...@databricks.com]
>> *Sent:* Monday, September 14, 2015 5:27 PM
>> *To:* Ulanov, Alexander
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Enum parameter in ML
>>
>>
>>
>> Since PipelineStages are serializable, the params must also be
>> serializable. We also have to keep the Java API in mind. Introducing a new
>> enum Param type may work, but we will have to ensure that Java users can
>> use it without dealing with ClassTags (I believe Scala will create new
>> types for each possible value in the Enum) and that it can be serialized.
>>
>>
>>
>> On Mon, Sep 14, 2015 at 4:31 PM, Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Dear Spark developers,
>>
>>
>>
>> I am currently implementing the Estimator in ML that has a parameter that
>> can take several different values that are mutually exclusive. The most
>> appropriate type seems to be Scala Enum (
>> http://www.scala-lang.org/api/current/index.html#scala.Enumeration).
>> However, the current ML API has the following parameter types:
>>
>> BooleanParam, DoubleArrayParam, DoubleParam, FloatParam, IntArrayParam,
>> IntParam, LongParam, StringArrayParam
>>
>>
>>
>> Should I introduce a new parameter type in ML API that is based on Scala
>> Enum?
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>
>


RE: Enum parameter in ML

2015-09-16 Thread Ulanov, Alexander
Hi Joseph,

Strings sounds reasonable. However, there is no StringParam (only 
StringArrayParam). Should I create a new param type? Also, how can the user get 
all possible values of String parameter?

Best regards, Alexander

From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Wednesday, September 16, 2015 5:35 PM
To: Feynman Liang
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Enum parameter in ML

I've tended to use Strings.  Params can be created with a validator (isValid) 
which can ensure users get an immediate error if they try to pass an 
unsupported String.  Not as nice as compile-time errors, but easier on the APIs.

On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang 
> wrote:
We usually write a Java test suite which exercises the public API (e.g. 
DCT).

It may be possible to create a sealed trait with singleton concrete instances 
inside of a serializable companion object, the just introduce a 
Param[SealedTrait] to the model (e.g. StreamingDecay 
PR).
 However, this would require Java users to use 
CompanionObject$.ConcreteInstanceName to access enum values which isn't the 
prettiest syntax.

Another option would just be to use Strings, which although is not type safe 
does simplify implementation.

On Mon, Sep 14, 2015 at 5:43 PM, Ulanov, Alexander 
> wrote:
Hi Feynman,

Thank you for suggestion. How can I ensure that there will be no problems for 
Java users? (I only use Scala API)

Best regards, Alexander

From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Monday, September 14, 2015 5:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Enum parameter in ML

Since PipelineStages are serializable, the params must also be serializable. We 
also have to keep the Java API in mind. Introducing a new enum Param type may 
work, but we will have to ensure that Java users can use it without dealing 
with ClassTags (I believe Scala will create new types for each possible value 
in the Enum) and that it can be serialized.

On Mon, Sep 14, 2015 at 4:31 PM, Ulanov, Alexander 
> wrote:
Dear Spark developers,

I am currently implementing the Estimator in ML that has a parameter that can 
take several different values that are mutually exclusive. The most appropriate 
type seems to be Scala Enum 
(http://www.scala-lang.org/api/current/index.html#scala.Enumeration). However, 
the current ML API has the following parameter types:
BooleanParam, DoubleArrayParam, DoubleParam, FloatParam, IntArrayParam, 
IntParam, LongParam, StringArrayParam

Should I introduce a new parameter type in ML API that is based on Scala Enum?

Best regards, Alexander





Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
@Alexander  It's worked for us to use Param[String] directly.  (I think
it's b/c String is exactly java.lang.String, rather than a Scala version of
it, so it's still Java-friendly.)  In other classes, I've added a static
list (e.g., NaiveBayes.supportedModelTypes), though there isn't consistent
coverage on that yet.

@Stephen  It could be used, but I prefer String for spark.ml since it's
easier to maintain consistent APIs across languages.  That's what we've
used so far, at least.

On Wed, Sep 16, 2015 at 6:00 PM, Stephen Boesch  wrote:

> There was a long thread about enum's initiated by Xiangrui several months
> back in which the final consensus was to use java enum's.  Is that
> discussion (/decision) applicable here?
>
> 2015-09-16 17:43 GMT-07:00 Ulanov, Alexander :
>
>> Hi Joseph,
>>
>>
>>
>> Strings sounds reasonable. However, there is no StringParam (only
>> StringArrayParam). Should I create a new param type? Also, how can the user
>> get all possible values of String parameter?
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Joseph Bradley [mailto:jos...@databricks.com]
>> *Sent:* Wednesday, September 16, 2015 5:35 PM
>> *To:* Feynman Liang
>> *Cc:* Ulanov, Alexander; dev@spark.apache.org
>> *Subject:* Re: Enum parameter in ML
>>
>>
>>
>> I've tended to use Strings.  Params can be created with a validator
>> (isValid) which can ensure users get an immediate error if they try to pass
>> an unsupported String.  Not as nice as compile-time errors, but easier on
>> the APIs.
>>
>>
>>
>> On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang 
>> wrote:
>>
>> We usually write a Java test suite which exercises the public API (e.g.
>> DCT
>> 
>> ).
>>
>>
>>
>> It may be possible to create a sealed trait with singleton concrete
>> instances inside of a serializable companion object, the just introduce a
>> Param[SealedTrait] to the model (e.g. StreamingDecay PR
>> ).
>> However, this would require Java users to use
>> CompanionObject$.ConcreteInstanceName to access enum values which isn't the
>> prettiest syntax.
>>
>>
>>
>> Another option would just be to use Strings, which although is not type
>> safe does simplify implementation.
>>
>>
>>
>> On Mon, Sep 14, 2015 at 5:43 PM, Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Hi Feynman,
>>
>>
>>
>> Thank you for suggestion. How can I ensure that there will be no problems
>> for Java users? (I only use Scala API)
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Feynman Liang [mailto:fli...@databricks.com]
>> *Sent:* Monday, September 14, 2015 5:27 PM
>> *To:* Ulanov, Alexander
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Enum parameter in ML
>>
>>
>>
>> Since PipelineStages are serializable, the params must also be
>> serializable. We also have to keep the Java API in mind. Introducing a new
>> enum Param type may work, but we will have to ensure that Java users can
>> use it without dealing with ClassTags (I believe Scala will create new
>> types for each possible value in the Enum) and that it can be serialized.
>>
>>
>>
>> On Mon, Sep 14, 2015 at 4:31 PM, Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Dear Spark developers,
>>
>>
>>
>> I am currently implementing the Estimator in ML that has a parameter that
>> can take several different values that are mutually exclusive. The most
>> appropriate type seems to be Scala Enum (
>> http://www.scala-lang.org/api/current/index.html#scala.Enumeration).
>> However, the current ML API has the following parameter types:
>>
>> BooleanParam, DoubleArrayParam, DoubleParam, FloatParam, IntArrayParam,
>> IntParam, LongParam, StringArrayParam
>>
>>
>>
>> Should I introduce a new parameter type in ML API that is based on Scala
>> Enum?
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>>
>>
>>
>>
>
>


Re: New Spark json endpoints

2015-09-16 Thread Kevin Chen
Just wanted to bring this email up again in case there were any thoughts.
Having all the information from the web UI accessible through a supported
json API is very important to us; are there any objections to us adding a v2
API to Spark?

Thanks!

From:  Kevin Chen 
Date:  Friday, September 11, 2015 at 11:30 AM
To:  "dev@spark.apache.org" 
Cc:  Matt Cheah , Mingyu Kim 
Subject:  New Spark json endpoints

Hello Spark Devs,

 I noticed that [SPARK-3454], which introduces new json endpoints at
/api/v1/[path] for information previously only shown on the web UI, does not
expose several useful properties about Spark jobs that are exposed on the
web UI and on the unofficial /json endpoint.

 Specific examples include the maximum number of allotted cores per
application, amount of memory allotted to each slave, and number of cores
used by each worker. These are provided at ‘app.cores, app.memoryperslave,
and worker.coresused’ in the /json endpoint, and also all appear on the web
UI page.

 Is there any specific reason that these fields are not exposed in the
public API? If not, would it be reasonable to add them to the json blobs,
possibly in a future /api/v2 API?

Thank you,
Kevin Chen





smime.p7s
Description: S/MIME cryptographic signature


RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
We actually meet the similiar problem in a real case, see 
https://issues.apache.org/jira/browse/SPARK-10474

After checking the source code, the external sort memory management strategy 
seems the root cause of the issue.

Currently, we allocate the 4MB (page size) buffer as initial in the beginning 
of the sorting, and during the processing of each input record, we possible run 
into the cycle of spill => de-allocate buffer => try allocate a buffer with 
size x2. I know this strategy is quite flexible in some cases. However, for 
example in a data skew case, says 2 tasks with large amount of records runs at 
a single executor, the keep growing buffer strategy will eventually eat out the 
pre-set offheap memory threshold, and then exception thrown like what we’ve 
seen.

I mean can we just take a simple memory management strategy for external 
sorter, like:
Step 1) Allocate a fixed size  buffer for the current task (maybe: 
MAX_MEMORY_THRESHOLD/(2 * PARALLEL_TASKS_PER_EXECUTOR))
Step 2) for (record in the input) { if (hasMemoryForRecord(record)) 
insert(record) else spill(buffer); insert(record); }
Step 3) Deallocate(buffer)

Probably we’d better to move the discussion in jira.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, September 17, 2015 12:28 AM
To: Pete Robbins
Cc: Dev
Subject: Re: Unable to acquire memory errors in HiveCompatibilitySuite

SparkEnv for the driver was created in SparkContext. The default parallelism 
field is set to the number of slots (max number of active tasks). Maybe we can 
just use the default parallelism to compute that in local mode.

On Wednesday, September 16, 2015, Pete Robbins 
> wrote:
so forcing the ShuffleMemoryManager to assume 32 cores and therefore calculate 
a pagesize of 1MB passes the tests.
How can we determine the correct value to use in getPageSize rather than 
Runtime.getRuntime.availableProcessors()?

On 16 September 2015 at 10:17, Pete Robbins 
> 
wrote:
I see what you are saying. Full stack trace:

java.io.IOException: Unable to acquire 4194304 bytes of memory
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:349)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:478)
  at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 

Re: New Spark json endpoints

2015-09-16 Thread Reynold Xin
Do we need to increment the version number if it is just strict additions?


On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen  wrote:

> Just wanted to bring this email up again in case there were any thoughts.
> Having all the information from the web UI accessible through a supported
> json API is very important to us; are there any objections to us adding a
> v2 API to Spark?
>
> Thanks!
>
> From: Kevin Chen 
> Date: Friday, September 11, 2015 at 11:30 AM
> To: "dev@spark.apache.org" 
> Cc: Matt Cheah , Mingyu Kim 
> Subject: New Spark json endpoints
>
> Hello Spark Devs,
>
>  I noticed that [SPARK-3454], which introduces new json endpoints at
> /api/v1/[path] for information previously only shown on the web UI, does
> not expose several useful properties about Spark jobs that are exposed on
> the web UI and on the unofficial /json endpoint.
>
>  Specific examples include the maximum number of allotted cores per
> application, amount of memory allotted to each slave, and number of cores
> used by each worker. These are provided at ‘app.cores, app.memoryperslave,
> and worker.coresused’ in the /json endpoint, and also all appear on the web
> UI page.
>
>  Is there any specific reason that these fields are not exposed in the
> public API? If not, would it be reasonable to add them to the json blobs,
> possibly in a future /api/v2 API?
>
> Thank you,
> Kevin Chen
>
>


Re: SparkR streaming source code

2015-09-16 Thread Shivaram Venkataraman
I think Hao posted a link to the source code in the description of
https://issues.apache.org/jira/browse/SPARK-6803

On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin  wrote:
> You should reach out to the speakers directly.
>
>
> On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong  wrote:
>>
>> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
>> share source code? (could not find it on GitHub)
>>
>>
>>
>> https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
>>
>>
>> Thanks,
>>
>> Renyi.
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org