To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.

With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
Ought to reduce shuffle performance hit (someone else confirm?)

#7 see default.shuffle.partitions (default: 200)

On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> Go through this once, if you haven't read it already.
> https://spark.apache.org/docs/latest/tuning.html
>
> Thanks
> Best Regards
>
> On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote:
>>
>> Hi All,
>>
>> I'm facing performance issues with spark implementation, and was briefly
>> investigating on WebUI logs, i noticed that my RDD size is 55GB & the
>> Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
>> application which does predictive analytics, so we keep most of our data
>> in
>> memory. This observation was only for 30mins usage of the application on a
>> single user. We anticipate atleast 10-15 users of the application sending
>> requests in parallel, which makes me a bit nervous.
>>
>> One constraint we have is that we do not have too many nodes in a cluster,
>> we may end up with 3-4 machines at best, but they can be scaled up
>> vertically each having 24 cores / 512 GB ram etc. which can allow us to
>> make
>> a virtual 10-15 node cluster.
>>
>> Even then the input size & shuffle write is too high for my liking. Any
>> suggestions in this regard will be greatly appreciated as there aren't
>> much
>> resource on the net for handling performance issues such as these.
>>
>> Some pointers on my application's data structures & design
>>
>> 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
>> Hashmaps & Value containing 1 Hashmap
>> 2) Data is loaded via JDBCRDD during application startup, which also tends
>> to take a lot of time, since we massage the data once it is fetched from
>> DB
>> and then save it as JavaPairRDD.
>> 3) Most of the data is structured, but we are still using JavaPairRDD,
>> have
>> not explored the option of Spark SQL though.
>> 4) We have only one SparkContext which caters to all the requests coming
>> into the application from various users.
>> 5) During a single user session user can send 3-4 parallel stages
>> consisting
>> of Map / Group By / Join / Reduce etc.
>> 6) We have to change the RDD structure using different types of group by
>> operations since the user can do drill down drill up of the data (
>> aggregation at a higher / lower level). This is where we make use of
>> Groupby's but there is a cost associated with this.
>> 7) We have observed, that the initial RDD's we create have 40 odd
>> partitions, but post some stage executions like groupby's the partitions
>> increase to 200 or so, this was odd, and we havn't figured out why this
>> happens.
>>
>> In summary we wan to use Spark to provide us the capability to process our
>> in-memory data structure very fast as well as scale to a larger volume
>> when
>> required in the future.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to