trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
I am using spark 1.5.1. I am running into some memory problems with a java unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I want to better understand what is going on so I can write better code in the future. The test runs on a Mac, master="Local[2]" I have a java unit

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
here's a good article that sums it up, in my opinion: https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ basically, building apps with RDDs is like building with apps with primitive JVM bytecode. haha. @richard: remember that even if you're currently writing

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Darren Govoni
I'll throw a thought in here. Dataframes are nice if your data is uniform and clean with consistent schema. However in many big data problems this is seldom the case.  Sent from my Verizon Wireless 4G LTE smartphone Original message From: Chris Fregly

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Annabel Melongo
Additionally, if you already have some legal sql statements to process said data, instead of reinventing the wheel using rdd's functions, you can speed up implementation by using dataframes along with these existing sql statements. On Monday, December 28, 2015 5:37 PM, Darren Govoni

Re: Can't submit job to stand alone cluster

2015-12-28 Thread Ted Yu
Have you verified that the following file does exist ? /home/hadoop/git/scalaspark/./target/scala-2.10/cluster- incidents_2.10-1.0.jar Thanks On Mon, Dec 28, 2015 at 3:16 PM, Daniel Valdivia wrote: > Hi, > > I'm trying to submit a job to a small spark cluster running

Re: Can't submit job to stand alone cluster

2015-12-28 Thread vivek.meghanathan
+ if exists whether it has read permission for the user who tries to run the job. Regards Vivek On Tue, Dec 29, 2015 at 6:56 am, Ted Yu > wrote: Have you verified that the following file does exist ?

Re: partitioning json data in spark

2015-12-28 Thread Նարեկ Գալստեան
Well, I could try to do that, but *partitionBy *method is anyway only supported for Parquet format even in Spark 1.5.1 Narek Narek Galstyan Նարեկ Գալստյան On 27 December 2015 at 21:50, Ted Yu wrote: > Is upgrading to 1.5.x a possibility for you ? > > Cheers > > On Sun,

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread zhangjp
hi all, I want to use sparkR or spark MLlib load csv file on hdfs then calculate covariance, how to do it . thks.

Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
Hi, I have input data set which is CSV file where I have date columns. My output will also be CSV file and will using this output CSV file as for hive table creation. I have few queries : 1.I tried using custom schema using Timestamp but it is returning empty result set when querying the

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
Load csv file: df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", header = "true") Calculate covariance: cov <- cov(df, "col1", "col2") Cheers Yanbo 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: > hi all, > I want to use sparkR or spark MLlib load csv

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
set spark.logConf as true in spark-default.conf will log the property in driver side. But it would only log the property you set, not including the properties with default value. On Mon, Dec 28, 2015 at 8:18 PM, alvarobrandon wrote: > Hello: > > I was wondering if its

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-28 Thread Priya Ch
Chris, we are using spark 1.3.0 version. we have not set spark.streaming.concurrentJobs this parameter. It takes the default value. Vijay, From the tack trace it is evident that

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
> > SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

Problem About Worker System.out

2015-12-28 Thread David John
I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurations: YARN Cluster, YARN-client mode. In Scala,writing a code

Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Gaurav Kumar
Hi Ted, I am using Spark 1.5.2 Without repartition in the picture, it works exactly as it's supposed to. With repartition, I am guessing when we call takeOrdered on train, it goes ahead and compute the rdd, which has repartitioning on it, and prints out the numbers. With the next call to

Is there anyway to log properties from a Spark application

2015-12-28 Thread alvarobrandon
Hello: I was wondering if its possible to log properties from Spark Applications like spark.yarn.am.memory, spark.driver.cores, spark.reducer.maxSizeInFlight without having to access the SparkConf object programmatically. I'm trying to find some kind of log file that has traces of the execution

Using Spark for high concurrent load tasks

2015-12-28 Thread Aliaksei Tsyvunchyk
Hello Spark community, We have a project where we want to use Spark as computation engine to perform calculations and return result via REST services. Working with Spark we have learned how to do things to make it work faster and finally optimize our code to produce results in acceptable time

Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Ted Yu
bq. the train and test have overlap in the numbers being outputted Can the call to repartition explain the above ? Which release of Spark are you using ? Thanks On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar wrote: > Hi, > > I noticed an inconsistent behavior when

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Eugene Morozov
Kendal, have you tried to reduce number of partitions? -- Be well! Jean Morozov On Mon, Dec 28, 2015 at 9:02 AM, kendal wrote: > My driver is running OOM with my 4T data set... I don't collect any data to > driver. All what the program done is map - reduce - saveAsTextFile.

Re: Problem About Worker System.out

2015-12-28 Thread Saisai Shao
Stdout will not be sent back to driver, no matter you use Scala or Java. You must do something wrongly that makes you think it is an expected behavior. On Mon, Dec 28, 2015 at 5:33 PM, David John wrote: > I have used Spark *1.4* for 6 months. Thanks all the

FW: Problem About Worker System.out

2015-12-28 Thread David John
Thanks. Can we use a slf4j/log4j logger to transfer our message from a worker to a driver?I saw some discussions say that we can use this code to transfer their message:object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName) }

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this? is there any chance that a single key - or set of keys- key has a large number of values relative to the other keys (aka. skew)? if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although I had some issues still with 1.5.1 in a similar

Spark DataFrame callUdf does not compile?

2015-12-28 Thread unk1102
Hi I am trying to invoke Hive UDF using dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it does not compile however same call works in Spark scala console I dont understand

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
If you run it as yarn-client mode, it will be client side log. If it is yarn-cluster mode, it will be logged in the AM container (the first container) On Mon, Dec 28, 2015 at 8:30 PM, Alvaro Brandon wrote: > Thanks for the swift response. > > I'm launching my

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Alvaro Brandon
Thanks for the swift response. I'm launching my applications through YARN. Where will these properties be logged?. I guess they wont be part of YARN logs 2015-12-28 13:22 GMT+01:00 Jeff Zhang : > set spark.logConf as true in spark-default.conf will log the property in > driver

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
Hi thanks you understood question incorrectly. First of all I am passing UDF name as String and if you see callUDF arguments then it does not take string as first argument and if I use callUDF it will throw me exception saying percentile_approx function not found. And another thing I mentioned is

Re: Stuck with DataFrame df.select("select * from table");

2015-12-28 Thread Annabel Melongo
Jean, Try this:df.select("""select * from tmptable where x1 = '3.0'""").show(); Note: you have to use 3 double quotes as marked On Friday, December 25, 2015 11:30 AM, Eugene Morozov wrote: Thanks for the comments, although the issue is not in limit()

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
Thanks but I tried everything I want to confirm I am writing code below if you can compile the following in Java with spark 1.5.2 then great otherwise nothing is helpful here as I am stumbling with this since last few days. public class PercentileHiveApproxTestMain { public static void

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Andy Davidson
Hi Yanbo I use spark.csv to load my data set. I work with both Java and Python. I would recommend you print the first couple of rows and also print the schema to make sure your data is loaded as you expect. You might find the following code example helpful. You may need to programmatically set

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
Would you mind sharing more of your code? I can't really see the code that well from the attached screenshot but it appears that "Lit" is capitalized. Not sure what this method actually refers to but the definition in functions.scala is lowercased. Even if that's not it, some more code would be

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
Also, if I'm reading correctly, it looks like you're calling "callUdf" when what you probably want is "callUDF" (notice the subtle capitalization difference). Docs:

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
If you scroll further down in the documentation, you will see that callUDF does have a version which takes (String, Column...) as arguments: *callUDF *

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Richard Eggert
One advantage of RDD's over DataFrames is that RDD's allow you to use your own data types, whereas DataFrames are backed by RDD's of Record objects, which are pretty flexible but don't give you much in the way of compile-time type checking. If you have an RDD of case class elements or JSON, then

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Michael Armbrust
Unfortunately in 1.5 we didn't force operators to spill when ran out of memory so there is not a lot you can do. It would be awesome if you could test with 1.6 and see if things are any better? On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Chris Fregly
the 200 number looks strangely similar to the following default number of post-shuffle partitions which is often left untuned: spark.sql.shuffle.partitions here's the property defined in the Spark source:

what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
Hi Michael I’ll try 1.6 and report back. The java doc does not say much about coalesce() or repartition(). When I use reparation() just before I save my output everything runs as expected I though coalesce() is an optimized version of reparation() and should be used when ever we know we are

Can't submit job to stand alone cluster

2015-12-28 Thread Daniel Valdivia
Hi, I'm trying to submit a job to a small spark cluster running in stand alone mode, however it seems like the jar file I'm submitting to the cluster is "not found" by the workers nodes. I might have understood wrong, but I though the Driver node would send this jar file to the worker nodes,

Re: Passing parameters to spark SQL

2015-12-28 Thread Aaron Jackson
Yeah, that's what I thought. In this specific case, I'm porting over some scripts from an existing RDBMS platform. I had been porting them (slowly) to in-code notation with python or scala. However, to expedite my efforts (and presumably theirs since I'm not doing this forever), I went down the

Re: partitioning json data in spark

2015-12-28 Thread Michael Armbrust
I don't think thats true (though if the docs are wrong we should fix that). In Spark 1.5 we converted JSON to go through the same code path as parquet. On Mon, Dec 28, 2015 at 12:20 AM, Նարեկ Գալստեան wrote: > Well, I could try to do that, > but *partitionBy *method is

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to

Re: ERROR server.TThreadPoolServer: Error occurred during processing of message

2015-12-28 Thread Dasun Hegoda
Anyone? On Sun, Dec 27, 2015 at 11:30 AM, Dasun Hegoda wrote: > I was able to figure out where the problem is exactly. It's spark. because > when I start the hiveserver2 manually and run query it work fine. but when > I try to access the hive through spark's thrift

[Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-28 Thread Divya Gehlot
Hi, I am newbee to Spark , My appologies for such a naive question I am using Spark 1.4.1 and wrtiting code in scala . I have input data as CSVfile which I am parsing using spark-csv package . I am creating custom schema to process the CSV file . Now my query is which dataype or can say

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Felix Cheung
Make sure you add the csv spark package as this example here so that the source parameter in R read.df would work: https://spark.apache.org/docs/latest/sparkr.html#from-data-sources _ From: Andy Davidson Sent: Monday, December 28,

map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread Divya Gehlot
Hi, I have HDP2.3.2 cluster installed in Amazon EC2. I want to update the IP adress of spark.driver.appUIAddress,which is currently mapped to private IP of EC2. Searched in spark config in ambari,could find spark.driver.appUIAddress property. Because of this private IP mapping,the spark webUI

Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-28 Thread jiml
Hi, a couple-three things. First, is this a Gradle project? SBT? Regardless of the answer, convince yourself that you are getting this error from the command line before doing anything else. Eclipse is awesome and it's also really glitchy, I have seen too many times recently where something funky

Re: Spark submit does automatically upload the jar to cluster?

2015-12-28 Thread jiml
That's funny I didn't delete that answer! I think I have two accounts crossing, here was the answer: I don't know if this is going to help, but I agree that some of the docs would lead one to believe that the Spark driver or master is going to spread your jars around for you. But there's other

Re: SPARK_CLASSPATH out, spark.executor.extraClassPath in?

2015-12-28 Thread jiml
I looked into this a lot more and posted an answer to a similar question on SO, but it's EC2 specific. Still might be some useful info in there and any comments/corrections/improvements would be greatly appreciated!

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Hyukjin Kwon
Hi Divya, Are you using or have you tried Spark CSV datasource https://github.com/databricks/spark-csv ? Thanks! 2015-12-28 18:42 GMT+09:00 Divya Gehlot : > Hi, > I have input data set which is CSV file where I have date columns. > My output will also be CSV file and

Re: map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread SparkUser
Wouldn't Amazon Elastic IP do this for you? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html On 12/28/2015 10:58 PM, Divya Gehlot wrote: Hi, I have HDP2.3.2 cluster installed in Amazon EC2. I want to update the IP adress of spark.driver.appUIAddress,which is

Re: what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Hyukjin Kwon
Hi Andy, This link explains the difference well. https://bzhangusc.wordpress.com/2015/08/11/repartition-vs-coalesce/ Simply the difference is whether it "repartitions" partitions or not. Actually coalesce() with suffering performs exactly woth repartition(). On 29 Dec 2015 08:10, "Andy

?????? how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread zhangjp
Now i have huge columns about 5k -20k, so if i want to Calculate covariance matrix ,which is the best method or common method ? -- -- ??: "Felix Cheung";; : 2015??12??29??(??) 12:45 ??: "Andy

RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
Spark does not support computing cov matrix now. But there is a PR for it. Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057 From: zhangjp [mailto:592426...@qq.com] Sent: Tuesday, December 29, 2015 3:21 PM To: Felix Cheung; Andy Davidson; Yanbo Liang Cc: user Subject: 回复: