Contributing to managed memory, Tungsten..

2016-03-10 Thread Jan Kotek
Hi, I would like to help with optimizing Spark memory usage. I have some experience with offheap, managed memory etc. For example I modified Hazelcast to run with '- Xmx128M' [1] and XAP from Gigaspaces uses my memory store. I already studied Spark code, read blogs, videos etc... But I have qu

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Deepak Gopalakrishnan
1. I'm using about 1 million users against few thousand products. I basically have around a million ratings 2. Spark 1.6 on Amazon EMR On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath wrote: > Could you provide more details about: > 1. Data set size (# ratings, # users and # products) > 2. Spark

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated items. I > first encode

Understanding fault tolerance in shuffle operations

2016-03-10 Thread Matt Cheah
Hi everyone, I have a question about the shuffle mechanisms in Spark and the fault-tolerance I should expect. Suppose I have a simple job with two stages – something like rdd.textFile().mapToPair().reduceByKey().saveAsTextFile(). The questions I have are, 1. Suppose I’m not using the exter

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546 <

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Cody Koeninger
The central problem with doing anything like this is that you break one of the basic guarantees of kafka, which is in-order processing on a per-topicpartition basis. As far as PRs go, because of the new consumer interface for kafka 0.9 and 0.10, there's a lot of potential change already underway.

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Renyi Xiong
Hi TD, Thanks a lot for offering to look at our PR (if we fire one) at the conference NYC. As we discussed briefly the issues of unbalanced and under-distributed kafka partitions when developing Spark streaming application in Mobius (C# for Spark), we're trying the option of repartitioning within

Re: submissionTime vs batchTime, DirectKafka

2016-03-10 Thread Sachin Aggarwal
hi can this be considered a lag in processing of events? should we report this as delay. On Thu, Mar 10, 2016 at 10:51 AM, Mario Ds Briggs wrote: > Look at > org.apache.spark.streaming.scheduler.JobGenerator > > it has a RecurringTimer (timer) that will simply post 'JobGenerate' > events to a E

Re: dataframe.groupby.agg vs sql("select from groupby)")

2016-03-10 Thread Reynold Xin
They should be identical. Can you paste the detailed explain output. On Thursday, March 10, 2016, FangFang Chen wrote: > hi, > Based on my testing, the memory cost is very different for > 1. sql("select * from ...").groupby.agg > 2. sql("select ... From ... Groupby ..."). > > For table.partition

dataframe.groupby.agg vs sql("select from groupby)")

2016-03-10 Thread FangFang Chen
hi, Based on my testing, the memory cost is very different for 1. sql("select * from ...").groupby.agg 2. sql("select ... From ... Groupby ..."). For table.partition sized more than 500g, 2# run good, while outofmemory happened in 1#. I am using the same spark configurations. Could somebody te