Re: Scala closure exceeds ByteArrayOutputStream limit (~2gb)

2017-08-22 Thread Mungeol Heo
Hello, Joel.

Have you solved the problem which is Java's 32-bit limit on array sizes?

Thanks.

On Wed, Jan 27, 2016 at 2:36 AM, Joel Keller  wrote:
> Hello,
>
> I am running RandomForest from mllib on a data-set which has very-high
> dimensional data (~50k dimensions).
>
> I get the following stack trace:
>
> 16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception:
> java.lang.OutOfMemoryError
> java.lang.OutOfMemoryError
> at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
> at
> org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:624)
> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
> at
> org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
> at
> org.apache.spark.mllib.tree.RandomForest.trainClassifier(RandomForest.scala)
> at com.miovision.cv.spark.CifarTrainer.main(CifarTrainer.java:108)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525
>
>
>
> I have determined that the problem is that when the ClosureCleaner checks
> that a closure is serializable (ensureSerializable), it serializes the
> closure to an underlying java bytebuffer, which is limited to about 2gb (due
> to signed 32-bit int).
>
> I believe that the closure has grown very large due to the high number of
> features (dimensions), and the statistics that must be collected for them.
>
>
> Does anyone know if there is a way that I can make mllib's randomforest
> implementation limit the size here such that it will not exceed 2gb
> serialized-closures, or alternatively is there a way to allow spark to work
> with such a large closure?
>
>
> I am running this training on a very large cluster of very large machines,
> so RAM is not the problem here.  Problem is java's 32-bit limit on array
> sizes.
>
>
> Thanks,
>
> Joel

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
It will work with spark-submit, if putting the configuration, which is
addressed below, under the maven-shade-plugin.


  

  net.minidev
  shaded.net.minidev

  


Still, need a way to make it work with spark-shell for testing purpose.
Any idea will be grate.

Thank you.

On Wed, Apr 5, 2017 at 6:52 PM, Mungeol Heo  wrote:
> Hello,
>
> I am using "minidev" which is a JSON lib to remove duplicated keys in
> JSON object.
>
> 
> minidev
> 
>
> 
>   net.minidev
>   json-smart
>   2.3
> 
>
> 
> Test Code
> 
>
> import net.minidev.json.parser.JSONParser
> val badJson = "{\"keyA\":\"valueA\",\"keyB\":\"valueB\",\"keyA\":\"valueA\"}"
> val json = new 
> JSONParser(JSONParser.MODE_PERMISSIVE).parse(badJson.toLowerCase())
> println(json)
>
> 
>
> The source code placed above works at IDE like intellij.
> But, it gives error at spark-shell
>
> 
> Error
> 
>
> net.minidev.json.parser.ParseException: Unexpected duplicate key:keya
> at position 33.
>
> 
>
> BTW, both IDE and spark-shell using same version of scala which is 2.11.8.
> And, of course, same version of "minidev"
>
> Any help will be great.
> Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
Hello,

I am using "minidev" which is a JSON lib to remove duplicated keys in
JSON object.


minidev



  net.minidev
  json-smart
  2.3



Test Code


import net.minidev.json.parser.JSONParser
val badJson = "{\"keyA\":\"valueA\",\"keyB\":\"valueB\",\"keyA\":\"valueA\"}"
val json = new 
JSONParser(JSONParser.MODE_PERMISSIVE).parse(badJson.toLowerCase())
println(json)



The source code placed above works at IDE like intellij.
But, it gives error at spark-shell


Error


net.minidev.json.parser.ParseException: Unexpected duplicate key:keya
at position 33.



BTW, both IDE and spark-shell using same version of scala which is 2.11.8.
And, of course, same version of "minidev"

Any help will be great.
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Need help for RDD/DF transformation.

2017-03-30 Thread Mungeol Heo
Hello ayan,

Same key will not exists in different lists.
Which means, If "1" exists in a list, then it will not be presented in
another list.

Thank you.

On Thu, Mar 30, 2017 at 3:56 PM, ayan guha  wrote:
> Is it possible for one key in 2 groups in rdd2?
>
> [1,2,3]
> [1,4,5]
>
> ?
>
> On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo  wrote:
>>
>> Hello Yong,
>>
>> First of all, thank your attention.
>> Note that the values of elements, which have values at RDD/DF1, in the
>> same list will be always same.
>> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
>> same value which is "a".
>>
>> The goal here is assigning same value to elements of the list which
>> does not exist in RDD/DF 1.
>> So, all the elements in the same list can have same value.
>>
>> Or, the final RDD/DF also can be like this,
>>
>> [1, 2, 3], a
>> [4, 5], b
>>
>> Thank you again.
>>
>> - Mungeol
>>
>>
>> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang  wrote:
>> > What is the desired result for
>> >
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, c
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> >
>> > Yong
>> >
>> > 
>> > From: Mungeol Heo 
>> > Sent: Wednesday, March 29, 2017 5:37 AM
>> > To: user@spark.apache.org
>> > Subject: Need help for RDD/DF transformation.
>> >
>> > Hello,
>> >
>> > Suppose, I have two RDD or data frame like addressed below.
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, a
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>> >
>> > 1, a
>> > 2, a
>> > 3, a
>> > 4, b
>> > 5, b
>> >
>> > Is there an efficient way to do this?
>> > Any help will be great.
>> >
>> > Thank you.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
> --
> Best Regards,
> Ayan Guha

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello Yong,

First of all, thank your attention.
Note that the values of elements, which have values at RDD/DF1, in the
same list will be always same.
Therefore, the "1" and "3", which from RDD/DF 1, will always have the
same value which is "a".

The goal here is assigning same value to elements of the list which
does not exist in RDD/DF 1.
So, all the elements in the same list can have same value.

Or, the final RDD/DF also can be like this,

[1, 2, 3], a
[4, 5], b

Thank you again.

- Mungeol


On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang  wrote:
> What is the desired result for
>
>
> RDD/DF 1
>
> 1, a
> 3, c
> 5, b
>
> RDD/DF 2
>
> [1, 2, 3]
> [4, 5]
>
>
> Yong
>
> 
> From: Mungeol Heo 
> Sent: Wednesday, March 29, 2017 5:37 AM
> To: user@spark.apache.org
> Subject: Need help for RDD/DF transformation.
>
> Hello,
>
> Suppose, I have two RDD or data frame like addressed below.
>
> RDD/DF 1
>
> 1, a
> 3, a
> 5, b
>
> RDD/DF 2
>
> [1, 2, 3]
> [4, 5]
>
> I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>
> 1, a
> 2, a
> 3, a
> 4, b
> 5, b
>
> Is there an efficient way to do this?
> Any help will be great.
>
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Need help for RDD/DF transformation.

2017-03-29 Thread Mungeol Heo
Hello,

Suppose, I have two RDD or data frame like addressed below.

RDD/DF 1

1, a
3, a
5, b

RDD/DF 2

[1, 2, 3]
[4, 5]

I need to create a new RDD/DF like below from RDD/DF 1 and 2.

1, a
2, a
3, a
4, b
5, b

Is there an efficient way to do this?
Any help will be great.

Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to clean the accumulator and broadcast from the driver manually?

2016-10-21 Thread Mungeol Heo
Hello,

As I mentioned at the title, I want to know is it possible to clean
the accumulator/broadcast from the driver manually since the driver's
memory keeps increasing.

Someone says that unpersist method removes them both from memory as
well as disk on each executor node. But it stays on the driver node,
so it can be re-broadcast

If it is true, how can I solve the "driver's memory keeps increasing" issue?

Any help will be GREAT!
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
First of all, Thank you for your comments.
Actually, What I mean "update" is generate a new data frame with modified data.
The more detailed while loop will be something like below.

var continue = 1
var dfA = "a data frame"
dfA.persist

while (continue > 0) {
  val temp = "modified dfA"
  temp.persist
  temp.count
  dfA.unpersist

  dfA = "modified temp"
  dfA.persist
  dfA.count
  temp.unperist

  if ("dfA is not modifed") {
continue = 0
  }
}

The problem is it will cause OOM finally.
And, the number of skipped stages will increase ever time, even though
I am not sure whether this is the reason causing OOM.
Maybe, I need to check the source code of one of the spark ML algorithms.
Again, thank you all.


On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
 wrote:
> Yes, iterating over a dataframe and making changes is not uncommon.
>
> Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> optimization in the optimizer that can potentially help to dampen the
> effect/impact of creating a new rdd, df or ds.
>
> Also, the use-case you cited is similar to what is done in regression,
> clustering and other algorithms.
>
> I.e. you iterate making a change to a dataframe/dataset until the desired
> condition.
>
> E.g. see this -
> https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
> and the setting of the iteration ceiling
>
>
>
> // instantiate the base classifier
>
> val classifier = new LogisticRegression()
>
>   .setMaxIter(params.maxIter)
>
>   .setTol(params.tol)
>
>   .setFitIntercept(params.fitIntercept)
>
>
>
> Now the impact of that depends on a variety of things.
>
> E.g. if the data is completely contained in memory and there is no spill
> over to disk, it might not be a big issue (ofcourse there will still be
> memory, CPU and network overhead/latency).
>
> If you are looking at storing the data on disk (e.g. as part of a checkpoint
> or explicit storage), then there can be substantial I/O activity.
>
>
>
>
>
>
>
> From: Xi Shen 
> Date: Monday, October 17, 2016 at 2:54 AM
> To: Divya Gehlot , Mungeol Heo
> 
> Cc: "user @spark" 
> Subject: Re: Is spark a right tool for updating a dataframe repeatedly
>
>
>
> I think most of the "big data" tools, like Spark and Hive, are not designed
> to edit data. They are only designed to query data. I wonder in what
> scenario you need to update large volume of data repetitively.
>
>
>
>
>
> On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
> wrote:
>
> If  my understanding is correct about your query
>
> In spark Dataframes are immutable , cant update the dataframe.
>
> you have to create a new dataframe to update the current dataframe .
>
>
>
>
>
> Thanks,
>
> Divya
>
>
>
>
>
> On 17 October 2016 at 09:50, Mungeol Heo  wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
>
>
> Thanks,
> David S.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is spark a right tool for updating a dataframe repeatedly

2016-10-16 Thread Mungeol Heo
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[1.6.0] Skipped stages keep increasing and causes OOM finally

2016-10-13 Thread Mungeol Heo
Hello,

My task is updating a dataframe in a while loop until there is no more data
to update.
The spark SQL I used is like below



val hc = sqlContext
hc.sql("use person")

var temp_pair = hc.sql("""
select ROW_NUMBER() OVER (ORDER BY PID) AS pair
, pid
, actionchanneluserid
from fdt_pid_channel_info
where dt = '2016-09-27'
and actionchanneltype = 2
""").repartition(200)
temp_pair.persist.registerTempTable("temp_pair")

var result = 1.0

while(result > 0) {
val temp1 = hc.sql("""
select B.PAIR as minpair, A.*
FROM TEMP_PAIR A
INNER JOIN (
SELECT pid, MIN(PAIR) AS PAIR
FROM TEMP_PAIR
GROUP BY pid) B
ON A.pid = B.pid
WHERE A.PAIR > B.PAIR
""")
temp1.persist.registerTempTable("temp1")

result = temp1.count

if(temp1.count > 0) {
val temp = temp_pair.except(hc.sql("select pair, pid, actionchanneluserid
from temp1")).unionAll(hc.sql("select minpair, pid, actionchanneluserid
from temp1")).coalesce(200)
temp.persist
temp.count
temp_pair.unpersist
temp_pair = temp
temp_pair.registerTempTable("temp_pair")
}

temp1.unpersist

val temp2 = hc.sql("""
select B.PAIR as minpair, A.*
FROM TEMP_PAIR A
INNER JOIN (
SELECT actionchanneluserid, MIN(PAIR) AS PAIR
FROM TEMP_PAIR
GROUP BY actionchanneluserid) B
ON A.actionchanneluserid = B.actionchanneluserid
WHERE A.PAIR > B.PAIR
""")
temp2.persist.registerTempTable("temp2")

result = result + temp2.count

if(temp2.count > 0) {
val temp = temp_pair.except(hc.sql("select pair, pid, actionchanneluserid
from temp2")).unionAll(hc.sql("select minpair, pid, actionchanneluserid
from temp2")).coalesce(200)
temp.persist
temp.count
temp_pair.unpersist
temp_pair = temp
temp_pair.registerTempTable("temp_pair")
}

temp2.unpersist
}

=

This job causes the skipped stages keep increasing and finally
"java.lang.OutOfMemoryError: Java heap space"


​
Is there any way to avoid this kind of situation?
Any help will be great!
Thank you


Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again.

On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao  wrote:
> Use dominant resource calculator instead of default resource calculator will
> get the expected vcores as you wanted. Basically by default yarn does not
> honor cpu cores as resource, so you will always see vcore is 1 no matter
> what number of cores you set in spark.
>
> On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna
>  wrote:
>>
>> Hi All,
>>
>> I am trying to run a spark job using yarn, and i specify --executor-cores
>> value as 20.
>> But when i go check the "nodes of the cluster" page in
>> http://hostname:8088/cluster/nodes then i see 4 containers getting created
>> on each of the node in cluster.
>>
>> But can only see 1 vcore getting assigned for each containier, even when i
>> specify --executor-cores 20 while submitting job using spark-submit.
>>
>> yarn-site.xml
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 6
>> 
>> 
>> yarn.scheduler.minimum-allocation-vcores
>> 1
>> 
>> 
>> yarn.scheduler.maximum-allocation-vcores
>> 40
>> 
>> 
>> yarn.nodemanager.resource.memory-mb
>> 7
>> 
>> 
>> yarn.nodemanager.resource.cpu-vcores
>> 20
>> 
>>
>>
>> Did anyone face the same issue??
>>
>> Regards,
>> Satyajit.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on

On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao  wrote:
> Use dominant resource calculator instead of default resource calculator will
> get the expected vcores as you wanted. Basically by default yarn does not
> honor cpu cores as resource, so you will always see vcore is 1 no matter
> what number of cores you set in spark.
>
> On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna
>  wrote:
>>
>> Hi All,
>>
>> I am trying to run a spark job using yarn, and i specify --executor-cores
>> value as 20.
>> But when i go check the "nodes of the cluster" page in
>> http://hostname:8088/cluster/nodes then i see 4 containers getting created
>> on each of the node in cluster.
>>
>> But can only see 1 vcore getting assigned for each containier, even when i
>> specify --executor-cores 20 while submitting job using spark-submit.
>>
>> yarn-site.xml
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 6
>> 
>> 
>> yarn.scheduler.minimum-allocation-vcores
>> 1
>> 
>> 
>> yarn.scheduler.maximum-allocation-vcores
>> 40
>> 
>> 
>> yarn.nodemanager.resource.memory-mb
>> 7
>> 
>> 
>> yarn.nodemanager.resource.cpu-vcores
>> 20
>> 
>>
>>
>> Did anyone face the same issue??
>>
>> Regards,
>> Satyajit.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to improve the performance for writing a data frame to a JDBC database?

2016-07-08 Thread Mungeol Heo
Hello,

I am trying to write a data frame to a JDBC database, like SQL server,
using spark 1.6.0.
The problem is "write.jdbc(url, table, connectionProperties)" is too slow.
Is there any way to improve the performance/speed?

e.g. options like partitionColumn, lowerBound, upperBound,
numPartitions which in read.jdbc and read.format("jdbc").

Any help will be great!!!
Thank you

- mungeol

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
The N is much bigger than 1 in my case.

Here is an example describes my issue.
"select column1, stddev_samp(column2) from table1 group by column1" gives NaN
"select column1, cast(stddev_samp(column2) as decimal(16,3)) from
table1 group by column1" gives numeric values. e.g. 234.234
"select column1, stddev_pop(column2) from table1 group by column1"
gives numeric values. e.g. 123.123123123

The column1, column2, and table1 are same.
My guess is that the stddev_samp function returns double type that
does not exactly match standard floating point semantics in some case.
That's why spark gives NaN.
It seems stddev_samp does not handle NaN well. Not like stddev_pop.

On Thu, Jul 7, 2016 at 5:57 PM, Sean Owen  wrote:
> Sample standard deviation can't be defined in the case of N=1, because
> it has N-1 in the denominator. My guess is that this is the case
> you're seeing. A population of N=1 still has a standard deviation of
> course (which is 0).
>
> On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo  wrote:
>> I know stddev_samp and stddev_pop gives different values, because they
>> have different definition. What I want to know is why stddev_samp
>> gives "NaN", and not a numeric value.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
I know stddev_samp and stddev_pop gives different values, because they
have different definition. What I want to know is why stddev_samp
gives "NaN", and not a numeric value.

On Thu, Jul 7, 2016 at 5:39 PM, Sean Owen  wrote:
> I don't think that's relevant here. The question is why would samp
> give a different result to pop, not the result of "stddev". Neither
> one is a 'correct' definition of standard deviation in the abstract;
> one or the other is correct depending on what standard deviation you
> are trying to measure.
>
> On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
>  wrote:
>> The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is the
>> correct one.
>>
>> You can actually work that one out yourself
>>
>>
>> BTW Hive also gives a wrong value. This is what I reported back in April
>> about Hive giving incorrect value
>>
>> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
>> did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
>> STDDEV_SAMP.
>>
>> The following query was used
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>> STDDEV(amount_sold) AS STDDEV,
>> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>> STDDEV_POP(amount_sold) AS STDDEV_POP
>> fromsales;
>>
>> The following is from running the above query on Hive where STDDEV -->
>> STDDEV_POP which is incorrect
>>
>>
>> ++-++-+--+
>> |  mystddev  |   stddev|stddev_samp |
>> stddev_pop  |
>> ++-++-+--+
>> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
>> 260.72704617040444  |
>> ++-++-+--+
>>
>> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
>> correct
>>
>> ++-++-+--+
>> |  mystddev  |   stddev|stddev_samp |
>> stddev_pop  |
>> ++-++-+--+
>> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
>> 260.72704617042166  |
>> ++-++-+--+
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 7 July 2016 at 09:29, Sean Owen  wrote:
>>>
>>> No, because these are different values defined differently. If you
>>> have 1 data point, the sample stdev is undefined while population
>>> stdev is defined. Refer to their definition.
>>>
>>> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo  wrote:
>>> > Hello,
>>> >
>>> > As I mentioned at the title, stddev_samp function gives a NaN while
>>> > stddev_pop gives a numeric value on the same data.
>>> > The stddev_samp function will give a numeric value, if I cast it to
>>> > decimal.
>>> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
>>> > Is it a bug?
>>> >
>>> > Thanks
>>> >
>>> > - mungeol
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
Hello,

As I mentioned at the title, stddev_samp function gives a NaN while
stddev_pop gives a numeric value on the same data.
The stddev_samp function will give a numeric value, if I cast it to decimal.
E.g. cast(stddev_samp(column_name) as decimal(16,3))
Is it a bug?

Thanks

- mungeol

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org