Re: partitionBy causing OOM

2017-09-26 Thread Amit Sela
> On Mon, Sep 25, 2017 at 7:30 PM, 孫澤恩 <gn00710...@gmail.com> wrote: >> >>> Hi, Amit, >>> >>> Maybe you can change this configuration spark.sql.shuffle.partitions. >>> The default is 200 change this property could change the task number >>

partitionBy causing OOM

2017-09-25 Thread Amit Sela
I'm trying to run a simple pyspark application that reads from file (json), flattens it (explode) and writes back to file (json) partitioned by date using DataFrameWriter.partitionBy(*cols). I keep getting OOMEs like: java.lang.OutOfMemoryError: Java heap space at

Re: Union of DStream and RDD

2017-02-11 Thread Amit Sela
? It's my > motivation and checkpointing my problem as well. > > 2017-02-08 22:02 GMT-08:00 Amit Sela <amitsel...@gmail.com>: > > Not with checkpointing. > > On Thu, Feb 9, 2017, 04:58 Egor Pahomov <pahomov.e...@gmail.com> wrote: > > Just guessing here, but would > http

Re: Union of DStream and RDD

2017-02-08 Thread Amit Sela
e DStream from your > RDD and than union with other DStream. > > 2017-02-08 12:32 GMT-08:00 Amit Sela <amitsel...@gmail.com>: > > Hi all, > > I'm looking to union a DStream and RDD into a single stream. > One important note is that the RDD has to be added to

Union of DStream and RDD

2017-02-08 Thread Amit Sela
Hi all, I'm looking to union a DStream and RDD into a single stream. One important note is that the RDD has to be added to the DStream just once. Ideas ? Thanks, Amit

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
er it before restarting the stream from checkpoints. > > On Tue, Feb 7, 2017 at 3:55 PM, Amit Sela <amitsel...@gmail.com> wrote: > > I know this approach, only thing is, it relies on the transformation being > an RDD transfomration as well and so could be applied via foreachRDD and

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
7, 2017 at 8:12 AM, Amit Sela <amitsel...@gmail.com> wrote: > > Hi all, > > I was wondering if anyone ever used a broadcast variable within > an updateStateByKey op. ? Using it is straight-forward but I was wondering > how it'll work after resuming from checkpoint (using the

Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Amit Sela
Hi all, I was wondering if anyone ever used a broadcast variable within an updateStateByKey op. ? Using it is straight-forward but I was wondering how it'll work after resuming from checkpoint (using the rdd.context() trick is not possible here) ? Thanks, Amit

RDD getPartitions() size and HashPartitioner numPartitions

2016-12-02 Thread Amit Sela
This might be a silly question, but I wanted to make sure, when implementing my own RDD, if using a HashPartitioner as the RDD's partitioner the number of partitions returned by the implementation of getPartitions() has to match the number of partitions set in the HashPartitioner, correct ?

Fault-tolerant Accumulators in a DStream-only transformations.

2016-11-29 Thread Amit Sela
Hi all, In order to recover Accumulators (functionally) from a Driver failure, it is recommended to use it within a foreachRDD/transform and use the RDD context with a Singleton wrapping the Accumulator as shown in the examples

Does MapWithState follow with a shuffle ?

2016-11-29 Thread Amit Sela
Hi all, I've been digging into MapWithState code (branch 1.6), and I came across the compute implementation in *InternalMapWithStateDStream*. Looking at

Fault-tolerant Accumulators in stateful operators.

2016-11-22 Thread Amit Sela
Hi all, To recover (functionally) Accumulators from Driver failure in a streaming application, we wrap them in a "getOrCreate" Singleton as shown here .

Many Spark metric names do not include the application name

2016-10-27 Thread Amit Sela
Hi guys, It seems that JvmSource / DAGSchedulerSource / BlockManagerSource / ExecutorAllocationManager and other metrics sources (except for the StreamingSource) publish their metrics directly under the "driver" fragment (or its executor counter-part) of the metric path without including the

Re: Subscribe

2016-09-26 Thread Amit Sela
Please Subscribe via the mailing list as described here: http://beam.incubator.apache.org/use/mailing-lists/ On Mon, Sep 26, 2016, 12:11 Lakshmi Rajagopalan wrote: > >

Dropping late date in Structured Streaming

2016-08-06 Thread Amit Sela
I've noticed that when using Structured Streaming with event-time windows (fixed/sliding), all windows are retained. This is clearly how "late" data is handled, but I was wondering if there is some pruning mechanism that I might have missed ? or is this planned in future releases ? Thanks, Amit

Re: spark 2.0 readStream from a REST API

2016-08-01 Thread Amit Sela
I think you're missing: val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() Dis it help ? On Mon, Aug 1, 2016 at 2:44 PM Jacek Laskowski wrote: > On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali > wrote: > >

init() and cleanup() for Spark map functions

2016-07-21 Thread Amit Sela
I have a use case where I use Spark (streaming) as a way to distribute a set of computations, which requires (some) of the computations to call an external service. Naturally, I'd like to manage my connections (per executor/worker). I know this pattern for DStream:

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Amit Sela
AM Koert Kuipers <ko...@tresata.com> wrote: > its the difference between a semigroup and a monoid, and yes max does not > easily fit into a monoid. > > see also discussion here: > https://issues.apache.org/jira/browse/SPARK-15598 > > On Mon, Jun 27, 2016 at 3:19 AM, Amit

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-27 Thread Amit Sela
ttps://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115 > > // maropu > > > On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela <amitsel...@gmail.com> wrote: > >> This "if (value == nu

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
t; codegen removes all the aggregates. > See: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199 > > // maropu > > On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela <amitsel...@gmail.com> wrote: > >&

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
+ zero() = > b` > The your case `b + null = null` breaks this rule. > This is the same with v1.6.1. > See: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57 > > // maropu > > > On Sun, Jun 26, 2016 at 6:

Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
Sometimes, the BUF for the aggregator may depend on the actual input.. and while this passes the responsibility to handle null in merge/reduce to the developer, it sounds fine to me if he is the one who put null in zero() anyway. Now, it seems that the aggregation is skipped entirely when zero() =

Are ser/de optimizations relevant with Dataset API and Encoders ?

2016-06-19 Thread Amit Sela
With RDD API, you could optimize shuffling data by making sure that bytes are shuffled instead of objects and using the appropriate ser/de mechanism before and after the shuffle, for example: Before parallelize, transform to bytes using a dedicated serializer, parallelize, and immediately after

Re: LegacyAccumulatorWrapper basically requires the Accumulator value to implement equlas() or it will fail on isZero()

2016-06-13 Thread Amit Sela
general if handing it over for a library to compare, add, clear, etc. > > On Mon, Jun 13, 2016 at 8:15 PM, Amit Sela <amitsel...@gmail.com> wrote: > > It seems that if you have an AccumulatorParam (or AccumulableParam) where > > "R" is not a primitive, it will nee

LegacyAccumulatorWrapper basically requires the Accumulator value to implement equlas() or it will fail on isZero()

2016-06-13 Thread Amit Sela
It seems that if you have an AccumulatorParam (or AccumulableParam) where "R" is not a primitive, it will need to implement equlas() if the implementation of the zero() creates a new instance (which I guess it will in those cases). This is where isZero applies the comparison:

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Amit Sela
t; On Sun, May 22, 2016 at 2:50 PM, Amit Sela <amitsel...@gmail.com> wrote: > >> I've been using Encoders with Kryo to support encoding of generically >> typed Java classes, mostly with success, in the following manner: >> >> public static Encoder encoder() { >>

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-22 Thread Amit Sela
I need to update this ;) To start with, you could just take a look at branch-2.0. On Sun, May 22, 2016, 01:23 Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Thank you, Amit! I was looking for this kind of information. > > I did not fully read your paper, I see in it a TODO with

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-13 Thread Amit Sela
HO if you have complex SQL queries, it makes sense you use DS/DF but if > you don't, then probably using RDDs directly is still faster. > > > Renato M. > > 2016-05-11 20:17 GMT+02:00 Amit Sela <amitsel...@gmail.com>: > >> Some how missed that ;) >> Anythi

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
Some how missed that ;) Anything about Datasets slowness ? On Wed, May 11, 2016, 21:02 Ted Yu <yuzhih...@gmail.com> wrote: > Which release are you using ? > > You can use the following to disable UI: > --conf spark.ui.enabled=false > > On Wed, May 11, 2016 at 10:5

Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
I've ran a simple WordCount example with a very small List as input lines and ran it in standalone (local[*]), and Datasets is very slow.. We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec. Is this just start-up overhead ? please note that I'm not timing the context creation... And

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela
I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking for, makes sense ? On Sun, Apr 10, 2016 at 4:08 PM Amit Sela <amitsel...@gmail.com> wrote: > I'm mapping RDD API to Datasets API and I was wondering if I was missing > something or is this functionalit

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela
gt; Thanks > > On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela <amitsel...@gmail.com> wrote: > >> Is there (planned ?) a combineByKey support for Dataset ? >> Is / Will there be a support for combiner lifting ? >> >> Thanks, >> Amit >> > >

Datasets combineByKey

2016-04-09 Thread Amit Sela
Is there (planned ?) a combineByKey support for Dataset ? Is / Will there be a support for combiner lifting ? Thanks, Amit