Collecting large dataset

2019-09-05 Thread Rishikesh Gawade
Hi. I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data. While collecting

How to combine all rows into a single row in DataFrame

2019-08-19 Thread Rishikesh Gawade
Hi All, I have been trying to serialize a dataframe in protobuf format. So far, I have been able to serialize every row of the dataframe by using map function and the logic for serialization within the same(within the lambda function). The resultant dataframe consists of rows in serialized

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Do you configure the same options > there? Can you share some code? > > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade >: > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL isn't able to d

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade > wrote: > >> H

Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Rishikesh Gawade
Hi. I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files. These subdirectories are actually created by spark jobs which ingest data from other sources and

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
To put it simply, what are the configurations that need to be done on the client machine so that it can run driver on itself and executors on spark-yarn cluster nodes? On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade wrote: > Hi. > I have been experiencing trouble while trying to c

Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
Hi. I have been experiencing trouble while trying to connect to a Spark cluster remotely. This Spark cluster is configured to run using YARN. Can anyone guide me or provide any step-by-step instructions for connecting remotely via spark-shell? Here's the setup that I am using: The Spark cluster is

How to use same SparkSession in another app?

2019-04-16 Thread Rishikesh Gawade
Hi. I wish to use a SparkSession created by one app in another app so that i can use the dataframes belonging to that session. Is it possible to use the same sparkSession in another app? Thanks, Rishikesh

Error: NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT while running a Spark-Hive Job

2018-04-16 Thread Rishikesh Gawade
gest me the required changes. Also, if it's the case that i might have misconfigured spark and hive, please suggest me the changes in configuration, a link guiding through all necessary configs would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade

ERROR: Hive on Spark

2018-04-15 Thread Rishikesh Gawade
then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade

Accessing Hive Database (On Hadoop) using Spark

2018-04-15 Thread Rishikesh Gawade
pache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I request you to please check this and if anything is wrong then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade