Re: Scala closure exceeds ByteArrayOutputStream limit (~2gb)

2017-08-22 Thread Mungeol Heo
Hello, Joel. Have you solved the problem which is Java's 32-bit limit on array sizes? Thanks. On Wed, Jan 27, 2016 at 2:36 AM, Joel Keller wrote: > Hello, > > I am running RandomForest from mllib on a data-set which has very-high > dimensional data (~50k dimensions). > >

Re: JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
, Apr 5, 2017 at 6:52 PM, Mungeol Heo <mungeol@gmail.com> wrote: > Hello, > > I am using "minidev" which is a JSON lib to remove duplicated keys in > JSON object. > > > minidev > >

JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
Hello, I am using "minidev" which is a JSON lib to remove duplicated keys in JSON object. minidev net.minidev json-smart 2.3 Test Code import net.minidev.json.parser.JSONParser val badJson =

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Mungeol Heo
gt; > [1,2,3] > [1,4,5] > > ? > > On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo <mungeol@gmail.com> wrote: >> >> Hello Yong, >> >> First of all, thank your attention. >> Note that the values of elements, which have values at RDD/DF1, in the

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
com> wrote: > What is the desired result for > > > RDD/DF 1 > > 1, a > 3, c > 5, b > > RDD/DF 2 > > [1, 2, 3] > [4, 5] > > > Yong > > > From: Mungeol Heo <mungeol@gmail.com> > Sent: Wednes

Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello, Suppose, I have two RDD or data frame like addressed below. RDD/DF 1 1, a 3, a 5, b RDD/DF 2 [1, 2, 3] [4, 5] I need to create a new RDD/DF like below from RDD/DF 1 and 2. 1, a 2, a 3, a 4, b 5, b Is there an efficient way to do this? Any help will be great. Thank you.

How to clean the accumulator and broadcast from the driver manually?

2016-10-21 Thread Mungeol Heo
Hello, As I mentioned at the title, I want to know is it possible to clean the accumulator/broadcast from the driver manually since the driver's memory keeps increasing. Someone says that unpersist method removes them both from memory as well as disk on each executor node. But it stays on the

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
ata on disk (e.g. as part of a checkpoint > or explicit storage), then there can be substantial I/O activity. > > > > > > > > From: Xi Shen <davidshe...@gmail.com> > Date: Monday, October 17, 2016 at 2:54 AM > To: Divya Gehlot <divya.htco...@gmail.com>, Munge

Is spark a right tool for updating a dataframe repeatedly

2016-10-16 Thread Mungeol Heo
Hello, everyone. As I mentioned at the tile, I wonder that is spark a right tool for updating a data frame repeatedly until there is no more date to update. For example. while (if there was a updating) { update a data frame A } If it is the right tool, then what is the best practice for this

[1.6.0] Skipped stages keep increasing and causes OOM finally

2016-10-13 Thread Mungeol Heo
Hello, My task is updating a dataframe in a while loop until there is no more data to update. The spark SQL I used is like below val hc = sqlContext hc.sql("use person") var temp_pair = hc.sql(""" select ROW_NUMBER() OVER (ORDER

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again. On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not >

How to improve the performance for writing a data frame to a JDBC database?

2016-07-08 Thread Mungeol Heo
Hello, I am trying to write a data frame to a JDBC database, like SQL server, using spark 1.6.0. The problem is "write.jdbc(url, table, connectionProperties)" is too slow. Is there any way to improve the performance/speed? e.g. options like partitionColumn, lowerBound, upperBound, numPartitions

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
s is that this is the case > you're seeing. A population of N=1 still has a standard deviation of > course (which is 0). > > On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo <mungeol@gmail.com> wrote: >> I know stddev_samp and stddev_pop gives different values, because they &g

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >&g

stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
Hello, As I mentioned at the title, stddev_samp function gives a NaN while stddev_pop gives a numeric value on the same data. The stddev_samp function will give a numeric value, if I cast it to decimal. E.g. cast(stddev_samp(column_name) as decimal(16,3)) Is it a bug? Thanks - mungeol