Re: Behind the scene of RDD to DataFrame

2016-02-20 Thread Hemant Bhanawat
toDF internally calls sqlcontext.createDataFrame which transforms the RDD to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe. Type conversions (from scala types to catalyst types) are involved but no shuffling. Hemant Bhanawat

Behind the scene of RDD to DataFrame

2016-02-20 Thread Weiwei Zhang
Hi there, Could someone explain to me what is behind the scene of rdd.toDF()? More importantly, will this step involve a lot of shuffles and cause the surge of the size of intermediate files? Thank you. Best Regards, Vivian

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-20 Thread Ted Yu
Have you looked at: SPARK-12662 Fix DataFrame.randomSplit to avoid creating overlapping splits Cheers On Sat, Feb 20, 2016 at 7:01 PM, tuan3w wrote: > I'm training a model using MLLib. When I try to split data into training > and > test data, I found a weird problem. I

Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread Glen
Any example code? In pyspark: sqlContex.sql("use mytable") my_df.saveAsTable("tmp_spark_debug", mode="overwrite") 1. The code above seems not register the table in hive. I have to create table from hdfs in hive, it reports some format error: rcformat and parquet. 2. Rerun the saveAsTable

Element appear in both 2 splits of RDD after using randomSplit

2016-02-20 Thread tuan3w
I'm training a model using MLLib. When I try to split data into training and test data, I found a weird problem. I can't figure what problem is happening here. Here is my code in experiment: val logData = rdd.map(x => (x._1, x._2)).distinct() val ratings: RDD[Rating] = logData.map(x =>

Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi, You can use sqlContext.sql("use ") before use dataframe.saveAsTable Hope it could be helpful Cheers Gen On Sun, Feb 21, 2016 at 9:55 AM, Glen wrote: > For dataframe in spark, so the table can be visited by hive. > > -- > Jacky Wang >

how to set database in DataFrame.saveAsTable?

2016-02-20 Thread Glen
For dataframe in spark, so the table can be visited by hive. -- Jacky Wang

Re: Fair scheduler pool details

2016-02-20 Thread Mark Hamstra
It's 2 -- and it's pretty hard to point to a line of code, a method, or even a class since the scheduling of Tasks involves a pretty complex interaction of several Spark components -- mostly the DAGScheduler, TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as well as the

RE: Spark JDBC connection - data writing success or failure cases

2016-02-20 Thread Mich Talebzadeh
Ok as I understand you mean pushing data from Spark to Oracle database via JDBC? Correct There are a number of ways to do so. Most common way is using Sqoop to get the data from HDFS file or Hive table Oracle database. With Spark you can also use that method by storing the data in Hive

Fair scheduler pool details

2016-02-20 Thread Eugene Morozov
Hi, I'm trying to understand how this thing works underneath. Let's say I have two types of jobs - high important, that might use small amount of cores and has to be run pretty fast. And less important, but greedy - uses as many cores as available. So, the idea is to use two corresponding pools.

Re: Spark Job Hanging on Join

2016-02-20 Thread Dave Moyers
Try this setting in your Spark defaults: spark.sql.autoBroadcastJoinThreshold=-1 I had a similar problem with joins hanging and that resolved it for me. You might be able to pass that value from the driver as a --conf option, but I have not tried that, and not sure if that will work. Sent

RE: Checking for null values when mapping

2016-02-20 Thread Mich Talebzadeh
Ok it turned out a bit less complicated than I thought :). I would be interested if creating temporary table from DF and pushing data into Hive the best option here? 1.Prepare and clean up data with filter & map 2.Convert the RDD to DF 3.Create temporary table from DF 4.

Re: JDBC based access to RDD

2016-02-20 Thread Shyam Sarkar
JdbcRDD.scala code is under the source code directory spark/core/src/main/scala/org/apache/spark/rdd Thanks. On Sat, Feb 20, 2016 at 11:51 AM, Shyam Sarkar wrote: > I was going through the scala code implementing RDD in Spark 1.6 source > code and I found

Re: Spark Streaming: Is it possible to schedule multiple active batches?

2016-02-20 Thread Neelesh
spark.streaming.concurrentJobs may help. Experimental according to TD from an older thread here http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming On Sat, Feb 20, 2016 at 11:24 AM, Jorge Rodriguez wrote: > > Is it possible to

Spark Streaming: Is it possible to schedule multiple active batches?

2016-02-20 Thread Jorge Rodriguez
Is it possible to have the scheduler schedule the next batch even if the previous batch has not completed yet? I'd like to schedule up to 3 batches at the same time if this is possible.

Re: Access to broadcasted variable

2016-02-20 Thread Ilya Ganelin
It gets serialized once per physical container, Instead of being serialized once per task of every stage that uses it. On Sat, Feb 20, 2016 at 8:15 AM jeff saremi wrote: > Is the broadcasted variable distributed to every executor or every worker? > Now i'm more confused >

Re: Spark execuotr Memory profiling

2016-02-20 Thread Nirav Patel
Thanks Nilesh. I don't think there;s heavy communication between driver and executor. However I'll try the settings you suggested. I can not replace groupBy with reduceBy as its not an associative operation. It is very frustrating to be honest. It was a piece of cake with map reduce compare to

RE: Access to broadcasted variable

2016-02-20 Thread jeff saremi
Is the broadcasted variable distributed to every executor or every worker? Now i'm more confused I thought it was supposed to save memory by distributing it to every worker and the executors would share that copy Date: Fri, 19 Feb 2016 16:48:59 -0800 Subject: Re: Access to broadcasted variable

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-20 Thread Chandeep Singh
Here is how you can list all HDFS directories for a given path. val hadoopConf = new org.apache.hadoop.conf.Configuration() val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://:8020"), hadoopConf) val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))

spark.driver.maxResultSize doesn't work in conf-file

2016-02-20 Thread AlexModestov
I have a string spark.driver.maxResultSize=0 in the spark-defaults.conf. But I get an error: "org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 18 tasks (1070.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)" But if I write --conf

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-20 Thread Divya Gehlot
Hi, @Umesh :You understanding is partially correct as per my requirement. My idea which I try to implement is Steps which I am trying to follow (Not sure how feasible it is I am new new bee to spark and scala) 1.List all the files under parent directory hdfs :///Testdirectory/ As list For

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Ah. Ok. > On Feb 20, 2016, at 2:31 PM, Mich Talebzadeh wrote: > > Yes I did that as well but no joy. My shell does it for windows files > automatically > > Thanks, > > Dr Mich Talebzadeh > > LinkedIn >

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ ) Has helped me in the past to deal with special characters while using windows based CSV’s in Linux. (Might not be the solution here.. Just an FYI :)) > On Feb 20, 2016, at 2:17 PM,

RE: Checking for null values when mapping

2016-02-20 Thread Mich Talebzadeh
Yes I did that as well but no joy. My shell does it for windows files automatically Thanks, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Understood. In that case Ted’s suggestion to check the length should solve the problem. > On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh wrote: > > Hi, > > That is a good question. > > When data is exported from CSV to Linux, any character that cannot be > transformed

RE: Checking for null values when mapping

2016-02-20 Thread Mich Talebzadeh
Hi, That is a good question. When data is exported from CSV to Linux, any character that cannot be transformed is replaced by ?. That question mark is not actually the expected “?” :) So the only way I can get rid of it is by drooping the first character using substring(1). I

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for that as well? And then you wouldn’t run into issues with index out of bound. val a = "?1,187.50" val b = "" println(a.substring(1).replace(",", "”)) —> 1187.50 println(a.replace("?", "").replace(",", "”))

Re: Streaming with broadcast joins

2016-02-20 Thread Srikanth
Sabastian, *Update:-* This is not possible. Probably will remain this way for the foreseeable future. https://issues.apache.org/jira/browse/SPARK-3863 Srikanth On Fri, Feb 19, 2016 at 10:20 AM, Sebastian Piu wrote: > I don't have the code with me now, and I ended

Re: Checking for null values when mapping

2016-02-20 Thread Ted Yu
For #2, you can filter out row whose first column has length 0. Cheers > On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh wrote: > > Thanks > > > So what I did was > > scala> val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >

RE: Checking for null values when mapping

2016-02-20 Thread Mich Talebzadeh
Thanks So what I did was scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net: string, VAT: string,

Is this likely to cause any problems?

2016-02-20 Thread Teng Qiu
@Daniel, there are at least 3 things that EMR can not solve, yet: - HA support - AWS provides auto scaling feature, but scale up/down EMR needs manual operations - security concerns in a public VPC EMR is basically designed for short term running use cases with some pre-defined bootstrap actions

Re: How to get the code for class in spark

2016-02-20 Thread Michał Zieliński
Probably you mean reflection: https://stackoverflow.com/questions/2224251/reflection-on-a-scala-case-class On 19 February 2016 at 15:14, Ashok Kumar wrote: > Hi, > > class body thanks > > > On Friday, 19 February 2016, 11:23, Ted Yu wrote: > >

Re: Checking for null values when mapping

2016-02-20 Thread Michał Zieliński
You can use filter and isNotNull on Column before the map. On 20 February 2016 at 08:24, Mich Talebzadeh wrote: > > > I have a DF like below reading a csv file > > > > > > val df = >

Re: spark stages in parallel

2016-02-20 Thread Hemant Bhanawat
Not possible as of today. See https://issues.apache.org/jira/browse/SPARK-2387 Hemant Bhanawat https://www.linkedin.com/in/hemant-bhanawat-92a3811 www.snappydata.io On Thu, Feb 18, 2016 at 1:19 PM, Shushant Arora wrote: > can two stages of single job run in parallel

Checking for null values when mapping

2016-02-20 Thread Mich Talebzadeh
I have a DF like below reading a csv file val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",",