Collecting large dataset

2019-09-05 Thread Rishikesh Gawade
Hi.
I have been trying to collect a large dataset(about 2 gb in size, 30
columns, more than a million rows) onto the driver side. I am aware that
collecting such a huge dataset isn't suggested, however, the application
within which the spark driver is running requires that data.
While collecting the dataframe, the spark job throws an error,
TaskResultLost( resultset lost from blockmanager).
I searched for solutions around this and set the following properties:
spark.blockManager.port, maxResultSize to 0(unlimited),
spark.driver.blockManager.port
and the application within which spark driver is running has 28 gb of max
heap size.
And yet the error arises again.
There are 22 executors running in my cluster.
Is there any config/necessary step that i am missing before collecting such
large data?
Or is there any other effective approach that would guarantee collecting
such large data without failure?

Thanks,
Rishikesh


How to combine all rows into a single row in DataFrame

2019-08-19 Thread Rishikesh Gawade
Hi All,
I have been trying to serialize a dataframe in protobuf format. So far, I
have been able to serialize every row of the dataframe by using map
function and the logic for serialization within the same(within the lambda
function). The resultant dataframe consists of rows in serialized format(1
row = 1 serialized message).
I wish to form a single protobuf serialized message for this dataframe and
in order to do that i need to combine all the serialized rows using some
custom logic very similar to the one used in map operation.
I am assuming that this would be possible by using the reduce operation on
the dataframe, however, i am unaware of how to go about it.
Any suggestions/approach would be much appreciated.

Thanks,
Rishikesh


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi,
I did not explicitly create a Hive Context. I have been using the
spark.sqlContext that gets created upon launching the spark-shell.
Isn't this sqlContext same as the hiveContext?
Thanks,
Rishikesh

On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:

> Do you use the HiveContext in Spark? Do you configure the same options
> there? Can you share some code?
>
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade  >:
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL isn't able to descend into the subdirectories over which the table
> is created. Could there be any other way?
> Thanks,
> Rishikesh
>
> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
> wrote:
>
>> which versions of Spark and Hive are you using.
>>
>> what will happen if you use parquet tables instead?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>> wrote:
>>
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has
>>> data stored in ORC format. This directory has several subdirectories inside
>>> it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest
>>> data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as
>>> *hive.mapred.supports.subdirectories=TRUE* and
>>> *mapred.input.dir.recursive**=TRUE*.
>>> As a result of this, when i fire the simplest query of *select count(*)
>>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>>> count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>
>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that
>>> this works as it does via hive cli?
>>> I am using Spark on YARN.
>>>
>>> Thanks,
>>> Rishikesh
>>>
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>> table, orc, sparksql, yarn
>>>
>>


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Rishikesh Gawade
Hi.
I am using Spark 2.3.2 and Hive 3.1.0.
Even if i use parquet files the result would be same, because after all
sparkSQL isn't able to descend into the subdirectories over which the table
is created. Could there be any other way?
Thanks,
Rishikesh

On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
wrote:

> which versions of Spark and Hive are you using.
>
> what will happen if you use parquet tables instead?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
> wrote:
>
>> Hi.
>> I have built a Hive external table on top of a directory 'A' which has
>> data stored in ORC format. This directory has several subdirectories inside
>> it, each of which contains the actual ORC files.
>> These subdirectories are actually created by spark jobs which ingest data
>> from other sources and write it into this directory.
>> I tried creating a table and setting the table properties of the same as
>> *hive.mapred.supports.subdirectories=TRUE* and
>> *mapred.input.dir.recursive**=TRUE*.
>> As a result of this, when i fire the simplest query of *select count(*)
>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>> count of records in the table.
>> However, when i fire the same query via sparkSQL, i get count = 0.
>>
>> I think the sparkSQL isn't able to descend into the subdirectories for
>> getting the data while hive is able to do so.
>> Are there any configurations needed to be set on the spark side so that
>> this works as it does via hive cli?
>> I am using Spark on YARN.
>>
>> Thanks,
>> Rishikesh
>>
>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>> table, orc, sparksql, yarn
>>
>


Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Rishikesh Gawade
Hi.
I have built a Hive external table on top of a directory 'A' which has data
stored in ORC format. This directory has several subdirectories inside it,
each of which contains the actual ORC files.
These subdirectories are actually created by spark jobs which ingest data
from other sources and write it into this directory.
I tried creating a table and setting the table properties of the same as
*hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive*
*=TRUE*.
As a result of this, when i fire the simplest query of *select count(*)
from ExtTable* via the Hive CLI, it successfully gives me the expected
count of records in the table.
However, when i fire the same query via sparkSQL, i get count = 0.

I think the sparkSQL isn't able to descend into the subdirectories for
getting the data while hive is able to do so.
Are there any configurations needed to be set on the spark side so that
this works as it does via hive cli?
I am using Spark on YARN.

Thanks,
Rishikesh

Tags: subdirectories, subdirectory, recursive, recursion, hive external
table, orc, sparksql, yarn


Re: Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
To put it simply, what are the configurations that need to be done on the
client machine so that it can run driver on itself and executors on
spark-yarn cluster nodes?

On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade 
wrote:

> Hi.
> I have been experiencing trouble while trying to connect to a Spark
> cluster remotely. This Spark cluster is configured to run using YARN.
> Can anyone guide me or provide any step-by-step instructions for
> connecting remotely via spark-shell?
> Here's the setup that I am using:
> The Spark cluster is running with each node as a docker container hosted
> on a VM. It is using YARN for scheduling resources for computations.
> I have a dedicated docker container acting as a spark client, on which i
> have the spark-shell installed(spark binary in standalone setup) and also
> the Hadoop and Yarn config directories set so that spark-shell can
> coordinate with the RM for resources.
> With all of this set, i tried using the following command:
>
> spark-shell --master yarn --deploy-mode client
>
> This results in the spark-shell giving me a scala-based console, however,
> when I check the Resource Manager UI on the cluster, there seems to be no
> application/spark session running.
> I have been expecting the driver to be running on the client machine and
> the executors running in the cluster. But that doesn't seem to happen.
>
> How can I achieve this?
> Is whatever I am trying feasible, and if so, a good practice?
>
> Thanks & Regards,
> Rishikesh
>


Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
Hi.
I have been experiencing trouble while trying to connect to a Spark cluster
remotely. This Spark cluster is configured to run using YARN.
Can anyone guide me or provide any step-by-step instructions for connecting
remotely via spark-shell?
Here's the setup that I am using:
The Spark cluster is running with each node as a docker container hosted on
a VM. It is using YARN for scheduling resources for computations.
I have a dedicated docker container acting as a spark client, on which i
have the spark-shell installed(spark binary in standalone setup) and also
the Hadoop and Yarn config directories set so that spark-shell can
coordinate with the RM for resources.
With all of this set, i tried using the following command:

spark-shell --master yarn --deploy-mode client

This results in the spark-shell giving me a scala-based console, however,
when I check the Resource Manager UI on the cluster, there seems to be no
application/spark session running.
I have been expecting the driver to be running on the client machine and
the executors running in the cluster. But that doesn't seem to happen.

How can I achieve this?
Is whatever I am trying feasible, and if so, a good practice?

Thanks & Regards,
Rishikesh


How to use same SparkSession in another app?

2019-04-16 Thread Rishikesh Gawade
Hi.
I wish to use a SparkSession created by one app in another app so that i
can use the dataframes belonging to that session. Is it possible to use the
same sparkSession in another app?
Thanks,
Rishikesh


Error: NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT while running a Spark-Hive Job

2018-04-16 Thread Rishikesh Gawade
Hello there,
I am using *spark-2.3.0* compiled using the following Maven Command:
*mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean install.*
I have configured it to run with *Hive v2.3.3*. Also, all the Hive related
jars (*v1.2.1*) in the Spark's JAR folder have been replaced by all the
JARs available in Hive's *lib* folder and i have also configured Hive to
use *spark* as an *execution engine*.

After that, in my Java application i have written the following lines:

SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("hive.metastore.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://hadoopmaster:9083")
.enableHiveSupport()
.getOrCreate();
HiveContext hc = new HiveContext(spark);
hc.sql("SHOW DATABASES").show();

I built the project thereafter (no compilation errors) and tried running
the job using spark-submit command as follows:

spark-submit --master yarn --class org.adbms.SpamFilter
IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar

On doing so, i received the following error:
Exception in thread "main" java.lang.NoSuchFieldError:
HIVE_STATS_JDBC_TIMEOUT
at
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
and more...


I have no idea what this is about. Is this because Spark v2.3.0 and Hive
v2.3.3 aren't compatible with each other?
If i have done anything wrong, i request you to point it out and suggest me
the required changes. Also, if it's the case that i might have
misconfigured spark and hive, please suggest me the changes in
configuration, a link guiding through all necessary configs would also be
appreciated.
Thank you in anticipation.
Regards,
Rishikesh Gawade


ERROR: Hive on Spark

2018-04-15 Thread Rishikesh Gawade
Hello there. I am a newbie in the world of Spark. I have been working on a
Spark Project using Java.
I have configured Hive and Spark to run on Hadoop.
As of now i have created a Hive (derby) database on Hadoop HDFS at the
given location(warehouse location): */user/hive/warehouse *and database
name as : *spam *(saved as *spam.db* at the aforementioned location).
I have been trying to read tables in this database in spark to create
RDDs/DataFrames.
Could anybody please guide me in how I can achieve this?
I used the following statements in my Java Code:

SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example").master("yarn")
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
spark.sql("USE spam");
spark.sql("SELECT * FROM spamdataset").show();

After this i built the project using Maven as follows: mvn clean package
-DskipTests and a JAR was generated.

After this, I tried running the project via spark-submit CLI using :

spark-submit --class com.adbms.SpamFilter --master yarn
~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar

and got the following error:

Exception in thread "main"
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException:
Database 'spam' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$
apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(
SessionCatalog.scala:174)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(
SessionCatalog.scala:256)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(
databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.
sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.
sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.
executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.adbms.SpamFilter.main(SpamFilter.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(
SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Also, I replaced the SQL query with "SHOW DATABASES", and it showed only
one database namely "default". Those stored on HDFS warehouse dir weren't
shown.

I request you to please check this and if anything is wrong then please
suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A
link to a webpage having relevant info would also be appreciated.
Thank you in anticipation.
Regards,
Rishikesh Gawade


Accessing Hive Database (On Hadoop) using Spark

2018-04-15 Thread Rishikesh Gawade
Hello there. I am a newbie in the world of Spark. I have been working on a
Spark Project using Java.
I have configured Hive and Spark to run on Hadoop.
As of now i have created a Hive (derby) database on Hadoop HDFS at the
given location(warehouse location): */user/hive/warehouse *and database
name as : *spam *(saved as *spam.db* at the aforementioned location).
I have been trying to read tables in this database in spark to create
RDDs/DataFrames.
Could anybody please guide me in how I can achieve this?
I used the following statements in my Java Code:

SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example").master("yarn")
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
spark.sql("USE spam");
spark.sql("SELECT * FROM spamdataset").show();

After this i built the project using Maven as follows: mvn clean package
-DskipTests and a JAR was generated.

After this, I tried running the project via spark-submit CLI using :

spark-submit --class com.adbms.SpamFilter --master yarn
~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar

and got the following error:

Exception in thread "main"
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database
'spam' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org
$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:174)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:256)
at
org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.adbms.SpamFilter.main(SpamFilter.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


I request you to please check this and if anything is wrong then please
suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A
link to a webpage having relevant info would also be appreciated.
Thank you in anticipation.
Regards,
Rishikesh Gawade