Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Divya Gehlot
Hi Omer , Here are couple of the solutions which you can implement for your use case : *Option 1 : * you can mount the S3 bucket as local file system Here are the details : https://cloud.netapp.com/blog/amazon-s3-as-a-file-system *Option 2 :* You can use Amazon Glue for your use case here are the

Re: Fastest way to drop useless columns

2018-05-31 Thread Divya Gehlot
you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way

Re: [Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Divya Gehlot
Hi , Here is example snippet in scala // Convert to a Date typeval timestamp2datetype: (Column) => Column = (x) => { to_date(x) }df = df.withColumn("date", timestamp2datetype(col("end_date"))) Hope this helps ! Thanks, Divya On 28 March 2018 at 15:16, Junfeng Chen

[Error :] RDD TO Dataframe Spark Streaming

2018-01-31 Thread Divya Gehlot
Hi, I am getting below error when creating Dataframe from twitter Streaming RDD val sparkSession:SparkSession = SparkSession .builder .appName("twittertest2") .master("local[*]") .enableHiveSupport()

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
DataSource APIs to build streaming > sources are not public yet, and are in flux. > > 2. Use Kafka/Kinesis as an intermediate system: Write something simple > that uses Twitter APIs directly to read tweets and write them into > Kafka/Kinesis. And then just read from Kafka/Kinesis

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
Hi , I see ,Does that means Spark structured streaming doesn't work with Twitter streams ? I could see people used kafka or other streaming tools and used spark to process the data in structured streaming . The below doesn't work directly with Twitter Stream until I set up Kafka ? > import

Spark Structured Streaming for Twitter Streaming data

2018-01-30 Thread Divya Gehlot
Hi, I am exploring the spark structured streaming . When turned to internet to understand about it I could find its more integrated with Kafka or other streaming tool like Kenesis. I couldnt find where we can use Spark Streaming API for twitter streaming data . Would be grateful ,if any body used

[Error] Python version mismatch in CDH cluster when running pyspark job

2017-06-16 Thread Divya Gehlot
Hi , I have a CDH cluster and running pyspark script in client mode There are different python version installed in client and worker nodes and was getting python version mismatch error. To resolve this issue I followed below cludera document

Spark job stopping abrubptly

2017-03-07 Thread Divya Gehlot
Hi, I have spark standalone cluster on AWS EC2 and recently my spark stream jobs stopping abrubptly. When I check the logs I found this 17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 17/03/07 06:09:39 ERROR

query on Spark Log directory

2017-01-05 Thread Divya Gehlot
Hi , I am using EMR machine and I could see the Spark log directory has grown till 4G. file name : spark-history-server.out Need advise how can I reduce the the size of the above mentioned file. Is there config property which can help me . Thanks, Divya

Re: Location for the additional jar files in Spark

2016-12-27 Thread Divya Gehlot
Hi Mich , Have you set SPARK_CLASSPATH in Spark-env.sh ? Thanks, Divya On 27 December 2016 at 17:33, Mich Talebzadeh wrote: > When one runs in Local mode (one JVM) on an edge host (the host user > accesses the cluster), it is possible to put additional jar file

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Divya Gehlot
Hi Mich, Can you try placing these jars in Spark Classpath. It should work . Thanks, Divya On 22 December 2016 at 05:40, Mich Talebzadeh wrote: > This works with Spark 2 with Oracle jar file added to > > $SPARK_HOME/conf/ spark-defaults.conf > > > > >

Re: How to get recent value in spark dataframe

2016-12-20 Thread Divya Gehlot
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html Hope this helps Thanks, Divya On 15 December 2016 at 12:49, Milin korath wrote: > Hi > > I have a spark data frame with following structure > > id flag price date > a 0

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Divya Gehlot
I am not pyspark person .. But from the errors I could figure out that your Spark application is having memory issues . Are you collecting the results to the driver at any point of time or have configured less memory for the nodes ? and If you are using Dataframes then there is issue raised in

Re: spark reshape hive table and save to parquet

2016-12-14 Thread Divya Gehlot
you can use udfs to do it http://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe Hope it will help. Thanks, Divya On 9 December 2016 at 00:53, Anton Kravchenko wrote: > Hello, > > I wonder if there is a way (preferably

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread Divya Gehlot
It depends on the use case ... Spark always depends on the resource availability . As long as you have resource to acoomodate ,can run as many spark/spark streaming application. Thanks, Divya On 15 December 2016 at 08:42, shyla deshpande wrote: > How many Spark

Re: Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Divya Gehlot
Hi, You can use the Column functions provided by Spark API https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html Hope this helps . Thanks, Divya On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: > Hi, > I have a sample, like: >

Re: how to extract arraytype data to file

2016-10-18 Thread Divya Gehlot
http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element Hope this helps Thanks, Divya On 19 October 2016 at 11:35, lk_spark wrote: > hi,all: > I want to read a json file and search it by sql . > the data struct

Re: tutorial for access elements of dataframe columns and column values of a specific rows?

2016-10-18 Thread Divya Gehlot
Can you please elaborate your use case ? On 18 October 2016 at 15:48, muhammet pakyürek wrote: > > > > > > -- > *From:* muhammet pakyürek > *Sent:* Monday, October 17, 2016 11:51 AM > *To:* user@spark.apache.org > *Subject:*

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Divya Gehlot
If my understanding is correct about your query In spark Dataframes are immutable , cant update the dataframe. you have to create a new dataframe to update the current dataframe . Thanks, Divya On 17 October 2016 at 09:50, Mungeol Heo wrote: > Hello, everyone. > > As

Re: converting hBaseRDD to DataFrame

2016-10-11 Thread Divya Gehlot
Hi Mich , you can create dataframe from RDD in below manner also val df = sqlContext.createDataFrame(rdd,schema) val df = sqlContext.createDataFrame(rdd) The below article also may help you : http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/ On 11 October 2016 at

read multiple files

2016-09-27 Thread Divya Gehlot
Hi, The input data files for my spark job generated at every five minutes file name follows epoch time convention as below : InputFolder/batch-147495960 InputFolder/batch-147495990 InputFolder/batch-147496020 InputFolder/batch-147496050 InputFolder/batch-147496080

Spark Application Log

2016-09-21 Thread Divya Gehlot
Hi, I have initialised the logging in my spark App /*Initialize Logging */ val log = Logger.getLogger(getClass.getName) Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) log.warn("Some text"+Somemap.size) When I run my spark job in using spark-submit like

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Divya Gehlot
Spark version plz ? On 21 September 2016 at 09:46, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Yeah I can do all operations on that folder > > On Sep 21, 2016 12:15 AM, "Kevin Mellott" > wrote: > >> Are you able to manually delete the folder below?

Re: 1TB shuffle failed with executor lost failure

2016-09-19 Thread Divya Gehlot
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 - i.e. an OutOfMemoryError Refer https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala On 19 September 2016 at 14:57,

how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Divya Gehlot
Hi, I am on EMR cluster and My cluster configuration is as below: Number of nodes including master node - 3 Memory:22.50 GB VCores Total : 16 Active Nodes : 2 Spark version- 1.6.1 Parameter set in spark-default.conf spark.executor.instances 2 > spark.executor.cores 8 >

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Divya Gehlot
appreciate the help. Thanks, Divya On 13 September 2016 at 15:09, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > Thanks all for your prompt response. > I followed the instruction in the docs EMR SSH tunnel > <https://docs.aws.amazon.com/ElasticMapReduce/latest/Ma

Ways to check Spark submit running

2016-09-13 Thread Divya Gehlot
Hi, Some how for time being I am unable to view Spark Web UI and Hadoop Web UI. Looking for other ways ,I can check my job is running fine apart from keep checking current yarn logs . Thanks, Divya

[Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 and Hadoop 2.7.2 When I am trying to view Any of the web UI of the cluster either hadoop or Spark ,I am getting below error " This site can’t be reached " Has anybody using EMR and able to view WebUI . Could you please share the steps. Would really

Error while calling udf Spark submit

2016-09-08 Thread Divya Gehlot
Hi, I am on Spark 1.6.1 I am getting below error when I am trying to call UDF in my spark Dataframe column UDF /* get the train line */ val deriveLineFunc :(String => String) = (str:String) => { val build_key = str.split(",").toList val getValue = if(build_key.length > 1)

Calling udf in Spark

2016-09-08 Thread Divya Gehlot
Hi, Is it necessary to import sqlContext.implicits._ whenever define and call UDF in Spark. Thanks, Divya

Getting memory error when starting spark shell but not often

2016-09-06 Thread Divya Gehlot
Hi, I am using EMR 4.7 with Spark 1.6 Sometimes when I start the spark shell I get below error OpenJDK 64-Bit Server VM warning: INFO: > os::commit_memory(0x0005662c, 10632822784, 0) failed; error='Cannot > allocate memory' (errno=12) > # > # There is insufficient memory for the Java

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
park Summit 2015 > <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Tue, Sep 6, 2016 at 4:45 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >&

[Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
Hi, I am getting below error if I try to use properties file paramater in spark-submit Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated at

[Spark-Submit:]Error while reading from s3n

2016-09-06 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 I am trying to read from s3n buckets in spark Option 1 : If I set up hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem") hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env("AWS_SECRET_ACCESS_KEY")) hadoopConf.set("fs.s3.awsAccessKeyId",

Re: difference between package and jar Option in Spark

2016-09-04 Thread Divya Gehlot
n Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > > Hi, > > > > Would like to know the difference between the --package and --jars > option in > > Spark . > > > > > > > > Thanks, > > Divya >

Re: [Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-09-02 Thread Divya Gehlot
385981/how-to-access-s3a-files-from-apache-spark Is it really the issue ? Could somebody help me validate the above ? Thanks, Divya On 1 September 2016 at 16:59, Steve Loughran <ste...@hortonworks.com> wrote: > > On 1 Sep 2016, at 03:45, Divya Gehlot <divya.htco...@gmail.com> wrote

difference between package and jar Option in Spark

2016-09-01 Thread Divya Gehlot
Hi, Would like to know the difference between the --package and --jars option in Spark . Thanks, Divya

Re: Window Functions with SQLContext

2016-09-01 Thread Divya Gehlot
Hi Saurabh, Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it threw the same error . So have to create HiveContext to access windows function Thanks, Divya On 1 September 2016 at 13:16, saurabh3d wrote: > Hi All, > > As per SPARK-11001

Re: Spark build 1.6.2 error

2016-08-31 Thread Divya Gehlot
Which java version are you using ? On 31 August 2016 at 04:30, Diwakar Dhanuskodi wrote: > Hi, > > While building Spark 1.6.2 , getting below error in spark-sql. Much > appreciate for any help. > > ERROR] missing or invalid dependency detected while loading class

[Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-08-31 Thread Divya Gehlot
Hi, I am using Spark 1.6.1 in EMR machine I am trying to read s3 buckets in my Spark job . When I read it through Spark shell I am able to read it ,but when I try to package the job and and run it as spark submit I am getting below error 16/08/31 07:36:38 INFO ApplicationMaster: Registered signal

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread Divya Gehlot
Can you please check order of all the data set of union all operations. Are they in same order ? On 9 August 2016 at 02:47, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it. > But I get this

[Spark 1.6]-increment value column based on condition + Dataframe

2016-08-09 Thread Divya Gehlot
Hi, I have column values having values like Value 30 12 56 23 12 16 12 89 12 5 6 4 8 I need create another column if col("value") > 30 1 else col("value") < 30 newColValue 0 1 0 1 2 3 4 0 1 2 3 4 5 How can I have create an increment column The grouping is happening based on some other cols

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
; https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. An

[Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
Hi, I have use case where I need to use or[||] operator in filter condition. It seems its not working its taking the condition before the operator and ignoring the other filter condition after or operator. As any body faced similar issue . Psuedo code :

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
ri, Aug 5, 2016 at 12:16 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> I am working with Spark 1.6 with scala and using Dataframe API . >> I have a use case where I need to compare two rows and add entry in the >> new column based on the lo

[Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
Hi, I am working with Spark 1.6 with scala and using Dataframe API . I have a use case where I need to compare two rows and add entry in the new column based on the lookup table for example : My DF looks like : col1col2 newCol1 street1 person1 street2 person1

Spark GraphFrames

2016-08-01 Thread Divya Gehlot
Hi, Has anybody has worked with GraphFrames. Pls let me know as I need to know the real case scenarios where It can used . Thanks, Divya

Re: FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
eUtil.html#fullyDelete(java.io.File) > > On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Resending to right list >> ------ Forwarded message -- >> From: "Divya Gehlot" <divya.htco...@gmail.c

FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
Hi, When I am doing the using theFileUtil.copymerge function val file = "/tmp/primaryTypes.csv" FileUtil.fullyDelete(new File(file)) val destinationFile= "/tmp/singlePrimaryTypes.csv" FileUtil.fullyDelete(new File(destinationFile)) val counts = partitions. reduceByKey {case (x,y) => x +

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot
Hi, I am getting below error when I am trying to save dataframe using Spark-CSV > > final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path) java.lang.NoSuchMethodError: > scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; > at >

Create dataframe column from list

2016-07-22 Thread Divya Gehlot
Hi, Can somebody help me by creating the dataframe column from the scala list . Would really appreciate the help . Thanks , Divya

add hours to from_unixtimestamp

2016-07-21 Thread Divya Gehlot
Hi, I need to add 8 hours to from_unixtimestamp df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time" I am try to joda time function def unixToDateTime (unix_timestamp : String) : DateTime = { val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours return utcTS }

calculate time difference between consecutive rows

2016-07-20 Thread Divya Gehlot
I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its giving me *null *in the result set. Would really appreciate the help.

getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-20 Thread Divya Gehlot
Hi, val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') - lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table"); Instead of time difference in seconds I am gettng null . Would reay appreciate the help. Thanks, Divya

Re: write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
2 AM > To: Rabin Banerjee <dev.rabin.baner...@gmail.com> > Cc: Divya Gehlot <divya.htco...@gmail.com>, "user @spark" < > user@spark.apache.org> > Subject: Re: write and call UDF in spark dataframe > > Hi Divya, > > There is already "from_unixtime&qu

difference between two consecutive rows of same column + spark + dataframe

2016-07-20 Thread Divya Gehlot
Hi, I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its not giving me *null *in the result set. Would really appreciate

write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi, Could somebody share example of writing and calling udf which converts unix tme stamp to date tiime . Thanks, Divya

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
hanks, Divya On 18 July 2016 at 23:06, Jacek Laskowski <ja...@japila.pl> wrote: > See broadcast variable. > > Or (just a thought) do join between DataFrames. > > Jacek > > On 18 Jul 2016 9:24 a.m., "Divya Gehlot" <divya.htco...@gmail.com> wrote: &g

Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
Hi, I have created a map by reading a text file val keyValueMap = file_read.map(t => t.getString(0) -> t.getString(4)).collect().toMap Now I have another dataframe where I need to dynamically replace all the keys of Map with values val df_input = reading the file as dataframe val df_replacekeys

find two consective points

2016-07-15 Thread Divya Gehlot
Hi, I have huge data set like similar below : timestamp,fieldid,point_id 1468564189,89,1 1468564090,76,4 1468304090,89,9 1468304090,54,6 1468304090,54,4 Have configuration file of consecutive points -- 1,9 4,6 like 1 and 9 are consecutive points similarly 4,6 are consecutive points Now I need

Re: Error joining dataframes

2016-05-18 Thread Divya Gehlot
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter").drop(df1("Id")) On May 18, 2016 2:16 PM, "ram kumar" wrote: I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required:

[Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Apache Phoenix 4.4 As Spark 1.5.2 doesn't support subquery in where conditions . https://issues.apache.org/jira/browse/SPARK-4226 Is there any alternative way to find foreign key constraints. Would really appreciate the help. Thanks, Divya

best fit - Dataframe and spark sql use cases

2016-05-09 Thread Divya Gehlot
Hi, I would like to know the uses cases where data frames is best fit and use cases where Spark SQL is best fit based on the one's experience . Thanks, Divya

Found Data Quality check package for Spark

2016-05-06 Thread Divya Gehlot
Hi, I just stumbled upon some data quality check package for spark https://github.com/FRosner/drunken-data-quality Has any body used it ? Would really appreciate the feedback . Thanks, Divya

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
, Ted Yu <yuzhih...@gmail.com> wrote: > I am afraid there is no such API. > > When persisting, you can specify StorageLevel : > > def persist(newLevel: StorageLevel): this.type = { > > Can you tell us your use case ? > > Thanks > > On Thu, May 5, 20

[Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
Hi, How can I get and set storage level for Dataframes like RDDs , as mentioned in following book links https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html Thanks, Divya

Fwd: package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ I am looking for something similar to above solution . -- Forwarded message -- From: "Divya Gehlot" <divya.htco...@gmail.com> Date: May 5, 2016 6:51 PM Subject:

package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
Hi, Is there any package or project in Spark/scala which supports Data Quality check? For instance checking null values , foreign key constraint Would really appreciate ,if somebody has already done it and happy to share or has any open source package . Thanks, Divya

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread Divya Gehlot
Divya On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote: > Check your javac version, and update it. > > > -- 原始邮件 ------ > *发件人:* "Divya Gehlot";<divya.htco...@gmail.com>; > *发送时间:* 2016年5月4日(星期三) 中午11:25 >

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Divya Gehlot
Hi , Even I am getting the similar error Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile When I tried to build Phoenix Project using maven . Maven version : 3.3 Java version - 1.7_67 Phoenix - downloaded latest master from Git hub If anybody find the the resolution

[Spark 1.5.2] Spark dataframes vs sql query -performance parameter ?

2016-05-03 Thread Divya Gehlot
Hi, I am interested to know on which parameters we can say Spark data frames are better sql queries . Would be grateful ,If somebody can explain me with the usecases . Thanks, Divya

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-29 Thread Divya Gehlot
more evenly. > > 2016-04-25 9:34 GMT+07:00 Divya Gehlot <divya.htco...@gmail.com>: > >> Hi, >> >> After joining two dataframes, saving dataframe using Spark CSV. >> But all the result data is being written to only one part file whereas >> there are 200 p

Re: Cant join same dataframe twice ?

2016-04-27 Thread Divya Gehlot
;a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), >> df2("b").as("2-b")) >> val df4 = df3.join(df2,

getting ClassCastException when calling UDF

2016-04-27 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 and defined below udf import org.apache.spark.sql.functions.udf > val myUdf = (wgts : Int , amnt :Float) => { > (wgts*amnt)/100.asInstanceOf[Float] > } > val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts, FloatType,col("RATE"),col("AMOUNT"))) In my

Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row can first() or head() to do that Thanks, Divya On 27 April 2016 at 13:24, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and processing of csv files. But it

Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
d this is clear. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> > wrote: > >> Also, check the column names of df1 ( after joining df2 and df3 ). >> >> Prasad. >>

Cant join same dataframe twice ?

2016-04-25 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid

[Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-24 Thread Divya Gehlot
Hi, After joining two dataframes, saving dataframe using Spark CSV. But all the result data is being written to only one part file whereas there are 200 part files being created, rest 199 part files are empty. What is the cause of uneven partitioning ? How can I evenly distribute the data ?

Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Divya Gehlot
Easy way of doing it newdf = df.withColumn('total', sum(df[col] for col in df.columns)) On 22 April 2016 at 11:51, Naveen Kumar Pokala wrote: > Hi, > > > > Do we have any way to perform Row level operations in spark dataframes. > > > > > > For example, > > > > I have

[Ask :]Best Practices - Application logging in Spark 1.5.2 + Scala 2.10

2016-04-21 Thread Divya Gehlot
Hi, I am using Spark with Hadoop 2.7 cluster I need to print all my print statement and or any errors to file for instance some info if passed some level or some error if something misisng in my Spark Scala Script. Can some body help me or redirect me tutorial,blog, books . Whats the best way to

[Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Divya Gehlot
Hi, I tried configuring logs to write it to file for Spark Driver and Executors . I have two separate log4j properties files for Spark driver and executor respectively. Its wrtiting log for Spark driver but for executor logs I am getting below error : java.io.FileNotFoundException:

Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going wrong. Experts please help . -- Forwarded message -- From: Divya Gehlot <divya.htco...@gmail.com> Date: 15 April 2016 at 19:13 Subject: [Help]:Strange Issue :Debug Spark Dataframe code To: "

[Help]:Strange Issue :Debug Spark Dataframe code

2016-04-15 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10. Is there any other option apart from "explain(true)" to debug Spark Dataframe code . I am facing strange issue . I have a lookuo dataframe and using it join another dataframe on different columns . I am getting *Analysis exception* in third join. When

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with exit code 143 . except one job where I am using unionAll and groupBy operation on multiple columns . Please advice me the options to optimize it . The one option which I am using it now --conf

[ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread Divya Gehlot
Hi, I would like to know does Spark Dataframe API has limit on creation of number of columns? Thanks, Divya

[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi, I hava a Hortonworks Hadoop cluster having below Configurations : Spark 1.5.2 HBASE 1.1.x Phoenix 4.4 I am able to connect to Phoenix through JDBC connection and able to read the Phoenix tables . But while writing the data back to Phoenix table I am getting below error :

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
Forgot to mention I am using all DataFrame API instead of sqls to the operations -- Forwarded message -- From: Divya Gehlot <divya.htco...@gmail.com> Date: 1 April 2016 at 18:35 Subject: [Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both To: "user @s

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
[image: Mic Drop] Hi, I have Hadoop Hortonworks 3 NODE Cluster on EC2 with *Hadoop *version 2.7.x *Spark *version - 1.5.2 *Phoenix *version - 4.4 *Hbase *version 1.1.x *Cluster Statistics * Date Node 1 OS: redhat7 (x86_64)Cores (CPU): 2 (2)Disk: 20.69GB/99.99GB (20.69% used) Memory: 7.39GB Date

Change TimeZone Setting in Spark 1.5.2

2016-03-29 Thread Divya Gehlot
Hi, The Spark set up is on Hadoop cluster. How can I set up the Spark timezone to sync with Server Timezone ? Any idea? Thanks, Divya

Re: [SQL] Two columns in output vs one when joining DataFrames?

2016-03-28 Thread Divya Gehlot
Hi Jacek , The difference is being mentioned in Spark doc itself Note that if you perform a self-join using this function without aliasing the input * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since * there is no way to disambiguate which side of the join you

[Spark -1.5.2]Dynamically creation of caseWhen expression

2016-03-23 Thread Divya Gehlot
Hi, I have a map collection . I am trying to build when condition based on the key values . Like df.withColumn("ID", when( condition with map keys ,values of map ) How can I do that dynamically. Currently I am iterating over keysIterator and get the values Kal keys = myMap.keysIterator.toArray

find the matching and get the value

2016-03-22 Thread Divya Gehlot
Hi, I am using Spark1.5.2 My requirement is as below df.withColumn("NoOfDays",lit(datediff(df("Start_date"),df("end_date" Now have to add one more columnn where my datediff(Start_date,end_date)) should match with map keys Map looks like MyMap(1->1D,2->2D,3->3M,4->4W) I want to do

Re: declare constant as date

2016-03-21 Thread Divya Gehlot
Oh my my I am so silly I can declare it as string and cast it to date My apologies for Spamming the mailing list. Thanks, Divya On 21 March 2016 at 14:51, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > In Spark 1.5.2 > Do we have any utiility which converts a constant

declare constant as date

2016-03-21 Thread Divya Gehlot
Hi, In Spark 1.5.2 Do we have any utiility which converts a constant value as shown below orcan we declare a date variable like val start_date :Date = "2015-03-02" val start_date = "2015-03-02" toDate like how we convert to toInt ,toString I searched for it but couldnt find it Thanks, Divya

Get the number of days dynamically in with Column

2016-03-20 Thread Divya Gehlot
I have a time stamping table which has data like No of Days ID 11D 22D and so on till 30 days Have another Dataframe with start date and end date I need to get the difference between these two days and get the ID from Time Stamping table and do With Column .

[Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Divya Gehlot
Hi, I am adding a new column and renaming it at same time but the renaming doesnt have any effect. dffiltered = >

[Error] : dynamically union All + adding new column

2016-03-19 Thread Divya Gehlot
Hi, I am dynamically doing union all and adding new column too val dfresult = > dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") > val schemaL = dfresult.schema > var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL) > for ((key,values) <- lcrMap)

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot
Hi, As I cant add colmns from another Dataframe I am planning to my row coulmns to map of key and arrays As I am new to scala and spark I am trying like below // create an empty map import scala.collection.mutable.{ArrayBuffer => mArrayBuffer} var map = Map[Int,mArrayBuffer[Any]]() def

[How To :]Custom Logging of Spark Scala scripts

2016-03-14 Thread Divya Gehlot
Hi, Can somebody point how can I confgure custom logs for my Spark (scala scripts) So that I can at which level my script failed and why ? Thanks, Divya

  1   2   >