Hi,
I would like to help with optimizing Spark memory usage. I have some experience
with offheap, managed memory etc. For example I modified Hazelcast to run with
'-
Xmx128M' [1] and XAP from Gigaspaces uses my memory store.
I already studied Spark code, read blogs, videos etc... But I have qu
1. I'm using about 1 million users against few thousand products. I
basically have around a million ratings
2. Spark 1.6 on Amazon EMR
On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath
wrote:
> Could you provide more details about:
> 1. Data set size (# ratings, # users and # products)
> 2. Spark
Could you provide more details about:
1. Data set size (# ratings, # users and # products)
2. Spark cluster set up and version
Thanks
On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote:
> Hello All,
>
> I've been running Spark's ALS on a dataset of users and rated items. I
> first encode
Hi everyone,
I have a question about the shuffle mechanisms in Spark and the fault-tolerance
I should expect. Suppose I have a simple job with two stages – something like
rdd.textFile().mapToPair().reduceByKey().saveAsTextFile().
The questions I have are,
1. Suppose I’m not using the exter
Spark 1.6.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.6 maintenance branch of Spark. We
*strongly recommend* all 1.6.0 users to upgrade to this release.
Notable fixes include:
- Workaround for OOM when writing large partitioned tables SPARK-12546
<
The central problem with doing anything like this is that you break
one of the basic guarantees of kafka, which is in-order processing on
a per-topicpartition basis.
As far as PRs go, because of the new consumer interface for kafka 0.9
and 0.10, there's a lot of potential change already underway.
Hi TD,
Thanks a lot for offering to look at our PR (if we fire one) at the
conference NYC.
As we discussed briefly the issues of unbalanced and
under-distributed kafka partitions when developing Spark streaming
application in Mobius (C# for Spark), we're trying the option of
repartitioning within
hi
can this be considered a lag in processing of events?
should we report this as delay.
On Thu, Mar 10, 2016 at 10:51 AM, Mario Ds Briggs
wrote:
> Look at
> org.apache.spark.streaming.scheduler.JobGenerator
>
> it has a RecurringTimer (timer) that will simply post 'JobGenerate'
> events to a E
They should be identical. Can you paste the detailed explain output.
On Thursday, March 10, 2016, FangFang Chen
wrote:
> hi,
> Based on my testing, the memory cost is very different for
> 1. sql("select * from ...").groupby.agg
> 2. sql("select ... From ... Groupby ...").
>
> For table.partition
hi,
Based on my testing, the memory cost is very different for
1. sql("select * from ...").groupby.agg
2. sql("select ... From ... Groupby ...").
For table.partition sized more than 500g, 2# run good, while outofmemory
happened in 1#. I am using the same spark configurations.
Could somebody te
10 matches
Mail list logo