Re: Questions for platform to choose

2019-08-22 Thread Liam Clarke-Hutchinson
Hi Eliza,

As I mentioned to you in the Kafka mailing list when you asked this there,
there are pros and cons to all of the technologies you've mentioned, and
you really need to sit down and try each solution to see what suits your
needs best.

Kind regards,

Liam Clarke

On Wed, Aug 21, 2019 at 9:46 PM Magnus Nilsson  wrote:

> Well, you are posting on the Spark mailing list. Though for streaming I'd
> recommend Flink over Spark any day of the week. Flink was written as a
> streaming platform from the beginning quickly aligning the API with the
> theoretical framework of Google's Dataflow whitepaper. It's awesome for
> streaming. Spark not so much so far. Might become better, though the inital
> use case for Spark wasn't streaming, they might overcome that or not. I'd
> still go with Flink for streaming.
>
> If you need cross platform support you can take a look at Beam. Beam has
> Dataflow, Spark and Flink runners among others.
>
> Regards,
>
> Magnus
>
> On Wed, Aug 21, 2019 at 8:43 AM Eliza  wrote:
>
>> Hello,
>>
>> We have all of spark, flink, storm, kafka installed.
>> For realtime streaming calculation, which one is the best above?
>> Like other big players, the logs in our stack are huge.
>>
>> Thanks.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


RDD size in memory - Array[String] vs. case classes

2014-10-10 Thread Liam Clarke-Hutchinson
Hi all,

I'm playing with Spark currently as a possible solution at work, and I've
been recently working out a rough correlation between our input data size
and RAM needed to cache an RDD that will be used multiple times in a job.

As part of this I've been trialling different methods of representing the
data, and I came across a result that surprised me, so I just wanted to
check what I was seeing.

So my data set is comprised of CSV with appx. 17 fields. When I load my
sample data set locally, and cache it after splitting on the comma as an
RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in
available RAM.

When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so
case classes are actually taking up less serialized space than an array.

Is it because case class represents numbers as numbers, as opposed to the
string array keeping them as strings?

Cheers,

Liam Clarke