Re: Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread andrew.rowson
Apologies, outlook for mac is ridiculous. Copy and paste the original below: - I’m running into a strange issue with trying to use a custom Log4j layout for Spark (1.6.1) on YARN (CDH). The layout is: https://github.com/michaeltandy/log4j-json If I use a log4j.properties file (supplied

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
Thanks for this tip. I ran it in yarn-client mode with driver-memory = 4G and took a dump once the heap got close to 4G. num#instances #bytes class name -- 1: 446169 3661137256 [J 2: 2032795 222636720

Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks (it’s reading a few TB of data). The job itself is fairly simple - it’s just getting a list of distinct values: val days = spark .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
I should have mentioned: yes I am using Kryo and have registered KeyClass and ValueClass. I guess it’s not clear to me what is actually taking up space on the driver heap - I can’t see how it can be data with the code that I have. On 27/08/2015 12:09, Ewan Leith ewan.le...@realitymine.com

Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread andrew.rowson
I've found a strange issue when trying to sort a lot of data in HDFS using spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class that derives from BytesWritable (the value is also a BytesWritable). I'm using a custom KryoSerializer to serialize the underlying byte array