Unsubscribe

2021-08-30 Thread Dhaval Patel

Re: Future timeout

2020-07-21 Thread Dhaval Patel
Just a suggestion, Looks like its timing out when you are broadcasting big object. Generally its not advisable to do so, if you can get rid of that, program may behave consistent. On Tue, Jul 21, 2020 at 3:17 AM Piyush Acharya wrote: > spark.conf.set("spark.sql.broadcastTimeout", ##) > >

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-14 Thread Dhaval Patel
Hi Abhinesh, As drop duplicates keeps first record, you can keep some id for 1st and 2nd df and then Union -> sort on that id -> drop duplicates. This will ensure records from 1st df is kept and 2nd are dropped. Regards Dhaval On Sat, Sep 14, 2019 at 4:41 PM Abhinesh Hada wrote: > Hey

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Dhaval Patel
Hi Charles, Can you check is any of the case related to output directory and checkpoint location mentioned in below link is applicable in your case? https://kb.databricks.com/streaming/file-sink-streaming.html Regards Dhaval On Wed, Sep 11, 2019 at 9:29 PM Burak Yavuz wrote: > Hey Charles, >

Re: Split RDD by key and save to different files

2016-09-07 Thread Dhaval Patel
In order to do that, first of all you need to Key RDD by Key. and then use saveAsHadoopFile in this way: We can use saveAsHadoopFile(location,classOf[KeyClass], classOf[ValueClass], classOf[PartitionOutputFormat]) When PartitionOutputFormat is extended from MultipleTextOutputFormat. Sample for

Error while storing datetime read from MySQL back to MySQL

2016-09-07 Thread Dhaval Patel
I am facing an error while trying to save Dataframe containing datetime field into MySQL table. What I am doing is: 1. Reading data from MySQL table which has fields of type datetime in MySQL. 2. Process Dataframe. 3. Store/Save Dataframe back into another MySQL table. While creating table, spark

Re: How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Dhaval Patel
Did you try implementing MultipleTextOutputFormat and use SaveAsHadoopFile with keyClass, valueClass and OutputFormat instead of default parameters? You need to implement toString for your keyClass and ValueClass inorder to get field separator other than defaults. Regards Dhaval On Tue, Jun

[sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-06 Thread Dhaval Patel
I have been struggling through this error since past 3 days and have tried all possible ways/suggestions people have provided on stackoverflow and here in this group. I am trying to read a parquet file using sparkR and convert it into an R dataframe for further usage. The file size is not that

Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Dhaval Patel
I am accessing a shared cluster mode Spark environment. However, there is an existing application (SparkSQL/Thrift Server), running under a different user, that occupies all available cores. Please see attached screenshot to get an idea about current resource utilization. Is there a way I can use

Re: Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Dhaval Patel
t; If its running the thrift server from hive, it's got a SQL API for you to > connect to... > > On 3 Sep 2015, at 17:03, Dhaval Patel <dhaval1...@gmail.com> wrote: > > I am accessing a shared cluster mode Spark environment. However, there is > an existing application (Spark

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-26 Thread Dhaval Patel
.getTime)).getDays) df.withColumn(diff, dateDiff(df(day2), df(day1))).show() or you can write sql query using hiveql's datediff function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote

Re: DataFrame/JDBC very slow performance

2015-08-26 Thread Dhaval Patel
Thanks Michael, much appreciated! Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. For your above comments - I don't get any error but just get the NULL as return

DataFrame/JDBC very slow performance

2015-08-24 Thread Dhaval Patel
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). When I tried with BIG table (5B records) then no results returned upon completion of query. I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores,

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show tables; Thanks, Dhaval On Thu, Aug 20, 2015 at 12:49 PM, Dhaval Patel dhaval1...@gmail.com wrote: Hi: I have been working on few

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
(e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE

How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
#pyspark.sql.functions.datediff Returns the number of days from start to end. df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2']) df.select(datediff(df.d2, df.d1).alias('diff')).collect()[Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1

Re: SparkSQL concerning materials

2015-08-20 Thread Dhaval Patel
Or if you're a python lover then this is a good place - https://spark.apache.org/docs/1.4.1/api/python/pyspark.sql.html# On Thu, Aug 20, 2015 at 10:58 AM, Ted Yu yuzhih...@gmail.com wrote: See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.package Cheers

How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show