Unable to acces hive table (created through hive context) in hive console

2015-12-07 Thread Divya Gehlot
Hi, I am new bee to Spark and using HDP 2.2 which comes with Spark 1.3.1 I tried following code example > import org.apache.spark.sql.SQLContext > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > > val personFile = "/user/hdfs/TestSpark/Person.csv" >

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-03 Thread Divya Gehlot
Hello, Even I have the same queries in mind . What all the upgrades where we can use EC2 as compare to normal servers for spark and other big data product development . Hope to get inputs from the community . Thanks, Divya On Dec 4, 2015 6:05 AM, "Andy Davidson"

how to skip headers when reading multiple files

2015-12-02 Thread Divya Gehlot
Hi, I am new bee to Spark and Scala . As one of my requirement to read and process multiple text files with headers using DataFrame API . How can I skip headers when processing data with DataFrame API Thanks in advance . Regards, Divya

persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot
Hi, I am new bee to Spark. Could somebody guide me how can I persist my spark RDD results in Hive using SaveAsTable API. Would appreciate if you could provide the example for hive external table. Thanks in advance.

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot
sqlContext.read.json(rdd) > df.saveAsTable(“your_table_name") > > > > > On Dec 7, 2015, at 5:28 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > > > > Hi, > > I am new bee to Spark. > > Could somebody guide me how can I persist my spark RD

getting error while persisting in hive

2015-12-09 Thread Divya Gehlot
Hi, I am using spark 1.4.1 . I am getting error when persisting spark dataframe output to hive > scala> > df.select("name","age").write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("PersonHiveTable"); > :39: error: org.apache.spark.sql.DataFrameWriter does not take >

set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Divya Gehlot
Hi, As per requirement I need to use Spark 1.4.1.But HDP doesnt comes with Spark 1.4.1 version. As instructed in this hortonworks page I am able to set up Spark 1.4 in HDP ,but when I run the spark shell It

Difference between Local Hive Metastore server and A Hive-based Metastore server

2015-12-17 Thread Divya Gehlot
Hi, I am new bee to spark and using 1.4.1 Got confused between Local Metastore server and a hive based metastore server. Can somebody share the usecases when to use which one and pros and cons ? I am using HDP 2,.3.2 in which hive-site-xml is already in spark configuration directory that means

Pros and cons -Saving spark data in hive

2015-12-15 Thread Divya Gehlot
Hi, I am new bee to Spark and I am exploring option and pros and cons which one will work best in spark and hive context.My dataset inputs are CSV files, using spark to process the my data and saving it in hive using hivecontext 1) Process the CSV file using spark-csv package and create

org.apache.spark.SparkException: Task failed while writing rows.+ Spark output data to hive table

2015-12-10 Thread Divya Gehlot
Hi, I am using HDP2.3.2 with Spark 1.4.1 and trying to insert data in hive table using hive context. Below is the sample code 1. spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m 2. //Sample code 3. import org.apache.spark.sql.SQLContext 4. import

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Divya Gehlot
> From reading the post, it appears that you resolved this issue. Great job! > > I would recommend putting the solution here as well so that it helps > another developer down the line. > > Thanks > > > On Monday, December 28, 2015 8:56 PM, Divya Gehlot < > divya.htco.

Error while starting Zeppelin Service in HDP2.3.2

2015-12-30 Thread Divya Gehlot
Hi, I am getting following error while starting the Zeppelin service from ambari server . /var/lib/ambari-agent/data/errors-2408.txt Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/2.3/services/ZEPPELIN/package/scripts/master.py", line 295, in

error creating custom schema

2015-12-23 Thread Divya Gehlot
Hi, I am trying to create custom schema but its throwing below error scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
15/12/28 00:25:40 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool 15/12/28 00:25:40 INFO DAGScheduler: Job 2 finished: saveAsTextFile at package.scala:157, took 9.293578 s P.S. : Attaching the output fi

DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
Hi, I am trying to join two dataframes and able to display the results in the console ater join. I am saving that data and and saving in the joined data in CSV format using spark-csv api . Its just saving the column names not data at all. Below is the sample code for the reference: spark-shell

DataFrame Vs RDDs ... Which one to use When ?

2015-12-27 Thread Divya Gehlot
Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
Hi, I have input data set which is CSV file where I have date columns. My output will also be CSV file and will using this output CSV file as for hive table creation. I have few queries : 1.I tried using custom schema using Timestamp but it is returning empty result set when querying the

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
> > SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
) are getting created just for small data set Attaching the dataset files too. On 28 December 2015 at 13:29, Divya Gehlot <divya.htco...@gmail.com> wrote: > yes > Sharing the execution flow > > 15/12/28 00:19:15 INFO SessionState: No Tez session required at this > point. hiv

[Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-28 Thread Divya Gehlot
Hi, I am newbee to Spark , My appologies for such a naive question I am using Spark 1.4.1 and wrtiting code in scala . I have input data as CSVfile which I am parsing using spark-csv package . I am creating custom schema to process the CSV file . Now my query is which dataype or can say

map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread Divya Gehlot
Hi, I have HDP2.3.2 cluster installed in Amazon EC2. I want to update the IP adress of spark.driver.appUIAddress,which is currently mapped to private IP of EC2. Searched in spark config in ambari,could find spark.driver.appUIAddress property. Because of this private IP mapping,the spark webUI

configure spark for hive context

2015-12-21 Thread Divya Gehlot
Hi, I am trying to configure spark for hive context (Please dont get mistaken with hive on spark ) I placed hive-site.xml in spark/CONF_DIR Now when I run spark-shell I am getting below error Version which I am using *Hadoop 2.6.2 Spark 1.5.2 Hive 1.2.1 * Welcome to >

error while defining custom schema in Spark 1.5.0

2015-12-22 Thread Divya Gehlot
Hi, I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark 1.5.0. I working on custom schema and getting error import org.apache.spark.sql.hive.HiveContext >> >> scala> import org.apache.spark.sql.hive.orc._ >> import org.apache.spark.sql.hive.orc._ >> >> scala> import

queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
> > Hi, As a beginner ,I have below queries on Spork(Pig on Spark). I have cloned git clone https://github.com/apache/pig -b spark . 1.On which version of Pig and Spark , Spork is being built ? 2. I followed the steps mentioned in https://issues.apache.org/ji ra/browse/PIG-4059 and try to

Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
com> wrote: > >>> Details at logfile: /home/pig/pig_1448425672112.log > > You need to check the log file for details > > > > > On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> >> >> As a b

pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-04 Thread Divya Gehlot
Hi, I have two input datasets First input dataset like as below : year,make,model,comment,blank > "2012","Tesla","S","No comment", > 1997,Ford,E350,"Go get one now they are going fast", > 2015,Chevy,Volt Second Input dataset : TagId,condition > 1997_cars,year = 1997 and

add new column in the schema + Dataframe

2016-02-04 Thread Divya Gehlot
Hi, I am beginner in spark and using Spark 1.5.2 on YARN.(HDP2.3.4) I have a use case where I have to read two input files and based on certain conditions in second input file ,have to add a new column in the first input file and save it . I am using spark-csv to read my input files . Would

Passing a dataframe to where clause + Spark SQL

2016-02-10 Thread Divya Gehlot
Hi, //Loading all the DB Properties val options1 = Map("url" -> "jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbname","user"->"username","password"->"password","dbtable" -> "TESTCONDITIONS") val testCond = sqlContext.load("jdbc",options1 ) val condval = testCond.select("Cond") testCond.show() val

Spark : Unable to connect to Oracle

2016-02-10 Thread Divya Gehlot
Hi, I am new bee to Spark and using Spark 1.5.2 version. I am trying to connect to Oracle DB using Spark API,getting errors : Steps I followed : Step 1- I placed the ojdbc6.jar in /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar Step 2- Registered the jar file

how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Divya Gehlot
Hi, I would like to know how to calculate how much -executor-memory should we allocate , how many num-executors,total-executor-cores we should give while submitting spark jobs . Is there any formula for it ? Thanks, Divya

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
cala/java/python, > it would be best to use the Dataframe API for creating dynamic SQL > queries. See > http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details. > > On Feb 2, 2016, at 6:49 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > > Hi, > D

Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
Hi, Does Spark supports dyamic sql ? Would really appreciate the help , if any one could share some references/examples. Thanks, Divya

[Query] : How to read null values in Spark 1.5.2

2016-02-24 Thread Divya Gehlot
Hi, I have a data set(source is data -> database) which has null values . When I am defining the custom schema as any type except string type, I get number format exception on null values . Has anybody come across this kind of scenario? Would really appreciate if you can share your resolution or

[Vote] : Spark-csv 1.3 + Spark 1.5.2 - Error parsing null values except String data type

2016-02-23 Thread Divya Gehlot
Hi, Please vote if you have ever faced this issue. I am getting error when parsing null values with Spark-csv DataFile : name age alice 35 bob null peter 24 Code : spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 --master yarn-client -i /TestDivya/Spark/Testnull.scala Testnull.scala

[Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi, I have dataset which looks like below name age alice 35 bob null peter 24 I need to replace null values of columns with 0 so I referred Spark API DataframeNAfunctions.scala

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Jan Štěrba <i...@jansterba.com> wrote: > just use coalesce function > > df.selectExpr("name", "coalesce(age, 0) as age") > > -- > Jan Sterba > https://twitter.com/honzasterba | http://flickr.com/honzasterba | > http://500px.com/honzasterba > >

Re: which master option to view current running job in Spark UI

2016-02-24 Thread Divya Gehlot
rce Manager UI to get to >> the ApplicationMaster url, irrespective of client or cluster mode. >> >> Regards >> Sab >> On 15-Feb-2016 10:10 am, "Divya Gehlot" <divya.htco...@gmail.com> wrote: >> >>> Hi, >>> I have Hortonworks 2.

[Help]: Steps to access hive table + Spark 1.5.2 + HbaseIntegration + Hive 1.2 + Hbase 1.1

2016-02-29 Thread Divya Gehlot
Hi, Can anybody help me by sharing the steps/examples How to connect to hive table(which is being created using HbaseIntegration ) through hivecontext in Spark I googled but couldnt find a single example/document . Would really

[Example] : Save dataframes with different schema + Spark 1.5.2 and Dataframe + Spark-CSV package

2016-02-22 Thread Divya Gehlot
Hi, My usecase : Have two datsets1 like below : year make model comment blank Carname 2012 Tesla S No comment 1997 Ford E350 Go get one now they are going fast MyFord 2015 Chevy Volt 2016 Mercedes Datset2 carowner year make model John 2012 Tesla S David Peter 1997 Ford E350 Paul 2015 Chevy Volt

Group by Dynamically

2016-01-24 Thread Divya Gehlot
Hi, I have two files File1 Group by Condition Field1 Y Field 2 N Field3 Y File2 is data file having field1,field2,field3 etc.. field1 field2 field3 field4 field5 data1 data2 data3 data4 data 5 data11 data22 data33 data44 data 55 Now my requirement is to group

IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Divya Gehlot
Hi, I am starting spark-shell with following options : spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages com.databricks:spark-csv_2.10:1.1.0 --master

Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Divya Gehlot
Hi, I would like to know difference between spark-shell and spark-submit in terms of real time scenarios. I am using Hadoop cluster with Spark on EC2. Thanks, Divya

which master option to view current running job in Spark UI

2016-02-14 Thread Divya Gehlot
Hi, I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files . I am bit confused between using *master *options I want to execute this spark job in YARN Curently running as spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars

Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
Hi, I have hadoop cluster set up in EC2. I am unable to view application logs in Web UI as its taking internal IP Like below : http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042 How can I change this to external one or

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
icMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html > > Regards > Sab > > On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> I have hadoop cluster set up in EC2. >> I am unable to view application logs in Web UI a

which is better RDD or Dataframe?

2016-02-15 Thread Divya Gehlot
Hi, I would like to know which gives better performance RDDs or dataframes ? Like for one scenario : 1.Read the file as RDD and register as temp table and fire SQL query 2.Read the file through Dataframe API or convert the RDD to dataframe and use dataframe APIs to process the data. For the

SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Divya Gehlot
Hi, SparkonHBase is integrated with which version of Spark and HBase ? Thanks, Divya

Spark JDBC connection - data writing success or failure cases

2016-02-18 Thread Divya Gehlot
Hi, I am a Spark job which connects to RDBMS (in mycase its Oracle). How can we check that complete data writing is successful? Can I use commit in case of success or rollback in case of failure ? Thanks, Divya

Re: Spark History Server NOT showing Jobs with Hortonworks

2016-02-18 Thread Divya Gehlot
Hi Sutanu , When you run your spark shell you would see below lines in your console 16/02/18 21:43:53 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041 16/02/18 21:43:53 INFO Utils: Successfully started service 'SparkUI' on port 4041. 16/02/18 21:43:54 INFO SparkUI: Started

Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-19 Thread Divya Gehlot
Hi, I have a use case ,where I have one parent directory File stucture looks like hdfs:///TestDirectory/spark1/part files( created by some spark job ) hdfs:///TestDirectory/spark2/ part files (created by some spark job ) spark1 and spark 2 has different schema like spark 1 part files schema

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-20 Thread Divya Gehlot
ch schema respective to that > sub-directory. > > 2) If you don't know the sub-directory names: > You need to store schema somewhere inside that sub-directory and read > it in iteration. > > On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <divya.htco...@gmail.com> > wr

[Example] : read custom schema from file

2016-02-21 Thread Divya Gehlot
Hi, Can anybody help me by providing me example how can we read schema of the data set from the file. Thanks, Divya

Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Divya Gehlot
Hi, I am trying to dynamically create Dataframe by reading subdirectories under parent directory My code looks like > import org.apache.spark._ > import org.apache.spark.sql._ > val hadoopConf = new org.apache.hadoop.conf.Configuration() > val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new >

[ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Divya Gehlot
Hi, I am getting error when I am trying to connect hive table (which is being created through HbaseIntegration) in spark Steps I followed : *Hive Table creation code *: CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH

Spark with Yarn Client

2016-03-11 Thread Divya Gehlot
Hi, I am trying to understand behaviour /configuration of spark with yarn client on hadoop cluster . Can somebody help me or point me document /blog/books which has deeper understanding of above two. Thanks, Divya

append rows to dataframe

2016-03-13 Thread Divya Gehlot
Hi, Please bear me for asking such a naive question I have list of conditions (dynamic sqls) sitting in hbase table . I need to iterate through those dynamic sqls and add the data to dataframes. As we know dataframes are immutable ,when I try to iterate in for loop as shown below I get only last

[Error] : dynamically union All + adding new column

2016-03-19 Thread Divya Gehlot
Hi, I am dynamically doing union all and adding new column too val dfresult = > dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") > val schemaL = dfresult.schema > var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL) > for ((key,values) <- lcrMap)

[Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Divya Gehlot
Hi, I am adding a new column and renaming it at same time but the renaming doesnt have any effect. dffiltered = >

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot
Hi, As I cant add colmns from another Dataframe I am planning to my row coulmns to map of key and arrays As I am new to scala and spark I am trying like below // create an empty map import scala.collection.mutable.{ArrayBuffer => mArrayBuffer} var map = Map[Int,mArrayBuffer[Any]]() def

[How To :]Custom Logging of Spark Scala scripts

2016-03-14 Thread Divya Gehlot
Hi, Can somebody point how can I confgure custom logs for my Spark (scala scripts) So that I can at which level my script failed and why ? Thanks, Divya

[ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread Divya Gehlot
Hi, I would like to know does Spark Dataframe API has limit on creation of number of columns? Thanks, Divya

[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi, I hava a Hortonworks Hadoop cluster having below Configurations : Spark 1.5.2 HBASE 1.1.x Phoenix 4.4 I am able to connect to Phoenix through JDBC connection and able to read the Phoenix tables . But while writing the data back to Phoenix table I am getting below error :

Re: declare constant as date

2016-03-21 Thread Divya Gehlot
Oh my my I am so silly I can declare it as string and cast it to date My apologies for Spamming the mailing list. Thanks, Divya On 21 March 2016 at 14:51, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > In Spark 1.5.2 > Do we have any utiility which converts a constant

Get the number of days dynamically in with Column

2016-03-20 Thread Divya Gehlot
I have a time stamping table which has data like No of Days ID 11D 22D and so on till 30 days Have another Dataframe with start date and end date I need to get the difference between these two days and get the ID from Time Stamping table and do With Column .

declare constant as date

2016-03-21 Thread Divya Gehlot
Hi, In Spark 1.5.2 Do we have any utiility which converts a constant value as shown below orcan we declare a date variable like val start_date :Date = "2015-03-02" val start_date = "2015-03-02" toDate like how we convert to toInt ,toString I searched for it but couldnt find it Thanks, Divya

[Spark -1.5.2]Dynamically creation of caseWhen expression

2016-03-23 Thread Divya Gehlot
Hi, I have a map collection . I am trying to build when condition based on the key values . Like df.withColumn("ID", when( condition with map keys ,values of map ) How can I do that dynamically. Currently I am iterating over keysIterator and get the values Kal keys = myMap.keysIterator.toArray

find the matching and get the value

2016-03-22 Thread Divya Gehlot
Hi, I am using Spark1.5.2 My requirement is as below df.withColumn("NoOfDays",lit(datediff(df("Start_date"),df("end_date" Now have to add one more columnn where my datediff(Start_date,end_date)) should match with map keys Map looks like MyMap(1->1D,2->2D,3->3M,4->4W) I want to do

Spark 1.5.2 -Better way to create custom schema

2016-03-04 Thread Divya Gehlot
Hi , I have a data set in HDFS . Is there any better any to define the custom schema for the data set having more 100+ fields of different data types. Thanks, Divya

[Issue:]Getting null values for Numeric types while accessing hive tables (Registered on Hbase,created through Phoenix)

2016-03-03 Thread Divya Gehlot
Hi, I am registering hive table on Hbase CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE") TBLPROPERTIES ("hbase.table.name" = "TEST",

Spark 1.5.2 - Read custom schema from file

2016-03-03 Thread Divya Gehlot
Hi, I have defined a custom schema as shown below : val customSchema = StructType( > StructField("year", IntegerType, true), > StructField("make", StringType, true), > StructField("model", StringType, true), > StructField("comment", StringType, true), StructField("blank",

Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Divya Gehlot
Hi, Could somebody help me by providing the steps /redirect me to blog/documentation on how to run Spark job written in scala through Oozie. Would really appreciate the help. Thanks, Divya

[Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Divya Gehlot
Hi, I need to iterate through columns in dataframe based on certain condition and put it in map . Dataset Column1 Column2 Car Model1 Bike Model2 Car Model2 Bike Model 2 I want to iterate through above dataframe and put it in map where car is key and model1 and model 2

[Error]Run Spark job as hdfs user from oozie workflow

2016-03-09 Thread Divya Gehlot
Hi, I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2 When I am submitting my spark scala script through shell script using Oozie workflow. I am submitting job as hdfs user but It is running as user = "yarn" so all the output should get store under user/yarn directory only . When

Re: [SQL] Two columns in output vs one when joining DataFrames?

2016-03-28 Thread Divya Gehlot
Hi Jacek , The difference is being mentioned in Spark doc itself Note that if you perform a self-join using this function without aliasing the input * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since * there is no way to disambiguate which side of the join you

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
Forgot to mention I am using all DataFrame API instead of sqls to the operations -- Forwarded message -- From: Divya Gehlot <divya.htco...@gmail.com> Date: 1 April 2016 at 18:35 Subject: [Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both To: "user @s

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
[image: Mic Drop] Hi, I have Hadoop Hortonworks 3 NODE Cluster on EC2 with *Hadoop *version 2.7.x *Spark *version - 1.5.2 *Phoenix *version - 4.4 *Hbase *version 1.1.x *Cluster Statistics * Date Node 1 OS: redhat7 (x86_64)Cores (CPU): 2 (2)Disk: 20.69GB/99.99GB (20.69% used) Memory: 7.39GB Date

Change TimeZone Setting in Spark 1.5.2

2016-03-29 Thread Divya Gehlot
Hi, The Spark set up is on Hadoop cluster. How can I set up the Spark timezone to sync with Server Timezone ? Any idea? Thanks, Divya

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with exit code 143 . except one job where I am using unionAll and groupBy operation on multiple columns . Please advice me the options to optimize it . The one option which I am using it now --conf

Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Divya Gehlot
Easy way of doing it newdf = df.withColumn('total', sum(df[col] for col in df.columns)) On 22 April 2016 at 11:51, Naveen Kumar Pokala wrote: > Hi, > > > > Do we have any way to perform Row level operations in spark dataframes. > > > > > > For example, > > > > I have

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-29 Thread Divya Gehlot
more evenly. > > 2016-04-25 9:34 GMT+07:00 Divya Gehlot <divya.htco...@gmail.com>: > >> Hi, >> >> After joining two dataframes, saving dataframe using Spark CSV. >> But all the result data is being written to only one part file whereas >> there are 200 p

Re: Cant join same dataframe twice ?

2016-04-27 Thread Divya Gehlot
;a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), >> df2("b").as("2-b")) >> val df4 = df3.join(df2,

getting ClassCastException when calling UDF

2016-04-27 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 and defined below udf import org.apache.spark.sql.functions.udf > val myUdf = (wgts : Int , amnt :Float) => { > (wgts*amnt)/100.asInstanceOf[Float] > } > val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts, FloatType,col("RATE"),col("AMOUNT"))) In my

Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row can first() or head() to do that Thanks, Divya On 27 April 2016 at 13:24, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and processing of csv files. But it

Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
d this is clear. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> > wrote: > >> Also, check the column names of df1 ( after joining df2 and df3 ). >> >> Prasad. >>

[Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-24 Thread Divya Gehlot
Hi, After joining two dataframes, saving dataframe using Spark CSV. But all the result data is being written to only one part file whereas there are 200 part files being created, rest 199 part files are empty. What is the cause of uneven partitioning ? How can I evenly distribute the data ?

Cant join same dataframe twice ?

2016-04-25 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid

[Ask :]Best Practices - Application logging in Spark 1.5.2 + Scala 2.10

2016-04-21 Thread Divya Gehlot
Hi, I am using Spark with Hadoop 2.7 cluster I need to print all my print statement and or any errors to file for instance some info if passed some level or some error if something misisng in my Spark Scala Script. Can some body help me or redirect me tutorial,blog, books . Whats the best way to

Re: Error joining dataframes

2016-05-18 Thread Divya Gehlot
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter").drop(df1("Id")) On May 18, 2016 2:16 PM, "ram kumar" wrote: I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required:

[Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Apache Phoenix 4.4 As Spark 1.5.2 doesn't support subquery in where conditions . https://issues.apache.org/jira/browse/SPARK-4226 Is there any alternative way to find foreign key constraints. Would really appreciate the help. Thanks, Divya

[Spark 1.5.2] Spark dataframes vs sql query -performance parameter ?

2016-05-03 Thread Divya Gehlot
Hi, I am interested to know on which parameters we can say Spark data frames are better sql queries . Would be grateful ,If somebody can explain me with the usecases . Thanks, Divya

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Divya Gehlot
Hi , Even I am getting the similar error Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile When I tried to build Phoenix Project using maven . Maven version : 3.3 Java version - 1.7_67 Phoenix - downloaded latest master from Git hub If anybody find the the resolution

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread Divya Gehlot
Divya On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote: > Check your javac version, and update it. > > > -- 原始邮件 ------ > *发件人:* "Divya Gehlot";<divya.htco...@gmail.com>; > *发送时间:* 2016年5月4日(星期三) 中午11:25 >

package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
Hi, Is there any package or project in Spark/scala which supports Data Quality check? For instance checking null values , foreign key constraint Would really appreciate ,if somebody has already done it and happy to share or has any open source package . Thanks, Divya

Fwd: package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ I am looking for something similar to above solution . -- Forwarded message -- From: "Divya Gehlot" <divya.htco...@gmail.com> Date: May 5, 2016 6:51 PM Subject:

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
, Ted Yu <yuzhih...@gmail.com> wrote: > I am afraid there is no such API. > > When persisting, you can specify StorageLevel : > > def persist(newLevel: StorageLevel): this.type = { > > Can you tell us your use case ? > > Thanks > > On Thu, May 5, 20

[Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
Hi, How can I get and set storage level for Dataframes like RDDs , as mentioned in following book links https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html Thanks, Divya

Found Data Quality check package for Spark

2016-05-06 Thread Divya Gehlot
Hi, I just stumbled upon some data quality check package for spark https://github.com/FRosner/drunken-data-quality Has any body used it ? Would really appreciate the feedback . Thanks, Divya

[Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Divya Gehlot
Hi, I tried configuring logs to write it to file for Spark Driver and Executors . I have two separate log4j properties files for Spark driver and executor respectively. Its wrtiting log for Spark driver but for executor logs I am getting below error : java.io.FileNotFoundException:

Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going wrong. Experts please help . -- Forwarded message -- From: Divya Gehlot <divya.htco...@gmail.com> Date: 15 April 2016 at 19:13 Subject: [Help]:Strange Issue :Debug Spark Dataframe code To: "

  1   2   >