Re: Advanced GC Tuning

2021-07-20 Thread Sean Owen
You're right, I think storageFraction is somewhat better to control this,
although some things 'counted' in spark.memory.fraction will also be
long-lived and in the OldGen.
You can also increase the OldGen size if you're pretty sure that's the
issue - 'old' objects in the YoungGen.

I'm not sure how much these will affect performance with modern JVMs; this
advice is 5-9 years old.

On Tue, Jul 20, 2021 at 5:39 PM Kuznetsov, Oleksandr
 wrote:

> Hello,
>
>
>
> I was reading the Garbage Collection Tuning guide here: Tuning - Spark
> 3.1.2 Documentation (apache.org)
> ,
> specifically section on “Advanced GC Tuning”. It is stated that if OldGen
> region is getting full, it is recommended to lower *spark.memory.fraction*.
> I am wondering if this would lower the overall amount of memory available
> for both storage and execution, slowing down execution. Isn’t it better to
> lower *spark.memory.storageFraction* instead?  In this case there is less
> memory available for caching objects, while execution is not being
> affected. Please see below the copy of the passage I am referring to:
>
>
>
> ·   “In the GC stats that are printed, if the OldGen is close to
> being full, reduce the amount of memory used for caching by lowering
> spark.memory.fraction; it is better to cache fewer objects than to slow
> down task execution. Alternatively, consider decreasing the size of the
> Young generation. This means lowering -Xmn if you’ve set it as above. If
> not, try changing the value of the JVM’s NewRatio parameter. Many JVMs
> default this to 2, meaning that the Old generation occupies 2/3 of the
> heap. It should be large enough such that this fraction exceeds
> spark.memory.fraction.”
>
> I would greatly appreciate if you could clarify it for me.
>
>
>


Advanced GC Tuning

2021-07-20 Thread Kuznetsov, Oleksandr
Hello,

I was reading the Garbage Collection Tuning guide here: Tuning - Spark 3.1.2 
Documentation 
(apache.org),
 specifically section on "Advanced GC Tuning". It is stated that if OldGen 
region is getting full, it is recommended to lower spark.memory.fraction. I am 
wondering if this would lower the overall amount of memory available for both 
storage and execution, slowing down execution. Isn't it better to lower 
spark.memory.storageFraction instead?  In this case there is less memory 
available for caching objects, while execution is not being affected. Please 
see below the copy of the passage I am referring to:


*   "In the GC stats that are printed, if the OldGen is close to being 
full, reduce the amount of memory used for caching by lowering 
spark.memory.fraction; it is better to cache fewer objects than to slow down 
task execution. Alternatively, consider decreasing the size of the Young 
generation. This means lowering -Xmn if you've set it as above. If not, try 
changing the value of the JVM's NewRatio parameter. Many JVMs default this to 
2, meaning that the Old generation occupies 2/3 of the heap. It should be large 
enough such that this fraction exceeds spark.memory.fraction."

I would greatly appreciate if you could clarify it for me.

Thank you in advance.

Best regards,
Oleksandr (Alex) Kuznetsov
DC Solution Specialist | Consulting/Strategy and Analytics
Deloitte Consulting LLP
1001 Heathrow Park Ln, Lake Mary, FL 32746
Mobile: 281-384-1331
olkuznet...@deloitte.com | 
www.deloitte.com


This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

Deloitte refers to a Deloitte member firm, one of its related entities, or 
Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a 
separate legal entity and a member of DTTL. DTTL does not provide services to 
clients. Please see www.deloitte.com/about to learn more.

v.E.1


Unpacking and using external modules with PySpark inside k8s

2021-07-20 Thread Mich Talebzadeh
I have been struggling with this.


Kubernetes (not that matters minikube is working fine. In one of the module
called configure.py  I am importing yaml module


import yaml


This is throwing errors


import yaml
ModuleNotFoundError: No module named 'yaml'


I have been through a number of loops.


First I created  virtual environment pyspark_venv.tar.gz that includes yaml
module and past it to spark-submit as follows


+ spark-submit --verbose --master k8s://192.168.49.2:8443
'--archives=hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv'
--deploy-mode cluster --name pytest --conf
'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1'
--conf 'spark.kubernetes.driver.limit.cores=1' --conf
'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf
'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf
'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount'
--py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs://
50.140.197.220:9000/minikube/codes/testyml.py


Parsed arguments:
  master  k8s://192.168.49.2:8443
  deployMode  cluster
  executorMemory  500m
  executorCores   1
  totalExecutorCores  null
  propertiesFile  /opt/spark/conf/spark-defaults.conf
  driverMemorynull
  driverCores null
  driverExtraClassPath$SPARK_HOME/jars/*.jar
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutors1
  files   null
  pyFiles hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip
  archiveshdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv
  mainClass   null
  primaryResource hdfs://
50.140.197.220:9000/minikube/codes/testyml.py
  namepytest
  childArgs   []
  jarsnull
  packagesnull
  packagesExclusions  null
  repositoriesnull
  verbose true


Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from
/tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv


printing sys.path
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages

 Printing user_paths
['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip',
'/opt/spark/python/lib/pyspark.zip',
'/opt/spark/python/lib/py4j-0.10.9-src.zip',
'/opt/spark/jars/spark-core_2.12-3.1.1.jar']
checking yaml
Traceback (most recent call last):
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
18, in 
main()
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
15, in main
import yaml
ModuleNotFoundError: No module named 'yaml'


Well it does not matter if it is yaml or numpy. It just cannot find the
modules. How can I find out if the gz file is unpacked OK?


Thanks


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
BTW what assumption is there that the thread owner is writing to the
cluster? The thrift server is running locally on localhost:1. I concur
that JDBC to remote Hive is needed. However, this is not the impression I
get here.

df.write
  .format("jdbc")
  .option("url", "jdbc:hive2://localhost:1/foundation;AuthMech=2;
UseNativeQuery=0")

There is some confusion somewhere!





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jul 2021 at 17:34, Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> From the Cloudera Documentation:
>
> https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
>
> UseNativeQuery
>  1: The driver does not transform the queries emitted by applications, so
> the native query is used.
>  0: The driver transforms the queries emitted by applications and converts
> them into an equivalent form in HiveQL.
>
>
> Try to change the "NativeQuery" parameter and see if it works :)
>
> On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Insert mode is "overwrite", it shouldn't doesn't matter if the table
>> already exists or not. The JDBC driver should be based on the Cloudera Hive
>> version, we can't know the CDH version he's using.
>>
>> On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> The driver is fine and latest and  it should work.
>>>
>>> I have asked the thread owner to send the DDL of the table and how the
>>> table is created. In this case JDBC from Spark expects the table to be
>>> there.
>>>
>>> The error below
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>>>
>>> Sounds like a mismatch between the columns through Spark Dataframe and
>>> the underlying table.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
 Badrinath is trying to write to a Hive in a cluster where he doesn't
 have permission to submit spark jobs, he doesn't have Hive/Spark metadata
 access.
 The only way to communicate with this third-party Hive cluster is
 through JDBC protocol.

 [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]

 Who's creating this table is "Spark" because he's using "overwrite" in
 order to test it.

  df.write
   .format("jdbc")
   .option("url",
 "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
   .option("dbtable", "test.test")
   .option("user", "admin")
   .option("password", "admin")
   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
 *  .mode("overwrite")*
   .save

 This error is weird, looks like the third-party Hive server isn't able
 to recognize the SQL dialect coming from  [Spark Standalone] server
 JDBC driver.

 1) I would try to execute the create statement manually in this server
 2) if works try to run again with "append" option

 I would open a case with Cloudera and ask which driver you should use.

 Thanks



 On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
 wrote:

> As Mich mentioned, no need to use jdbc API, using the
> DataFrameWriter's saveAsTable method is the way to go.   JDBC Driver is 
> for
> a JDBC client (a Java client for instance) to access the Hive tables in
> Spark via the Thrift server interface.
>
> -- ND
>
> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>
> I have trying to create table in hive from spark itself,
>
> And 

unsubscribe

2021-07-20 Thread Du Li



Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
>From the Cloudera Documentation:
https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf

UseNativeQuery
 1: The driver does not transform the queries emitted by applications, so
the native query is used.
 0: The driver transforms the queries emitted by applications and converts
them into an equivalent form in HiveQL.


Try to change the "NativeQuery" parameter and see if it works :)

On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Insert mode is "overwrite", it shouldn't doesn't matter if the table
> already exists or not. The JDBC driver should be based on the Cloudera Hive
> version, we can't know the CDH version he's using.
>
> On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
> wrote:
>
>> The driver is fine and latest and  it should work.
>>
>> I have asked the thread owner to send the DDL of the table and how the
>> table is created. In this case JDBC from Spark expects the table to be
>> there.
>>
>> The error below
>>
>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>> processing query/statement. Error Code: 4, SQL state:
>> TStatus(statusCode:ERROR_STATUS,
>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>> foreign key:28:27,
>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>>
>> Sounds like a mismatch between the columns through Spark Dataframe and
>> the underlying table.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>> Badrinath is trying to write to a Hive in a cluster where he doesn't
>>> have permission to submit spark jobs, he doesn't have Hive/Spark metadata
>>> access.
>>> The only way to communicate with this third-party Hive cluster is
>>> through JDBC protocol.
>>>
>>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>>
>>> Who's creating this table is "Spark" because he's using "overwrite" in
>>> order to test it.
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>> *  .mode("overwrite")*
>>>   .save
>>>
>>> This error is weird, looks like the third-party Hive server isn't able
>>> to recognize the SQL dialect coming from  [Spark Standalone] server
>>> JDBC driver.
>>>
>>> 1) I would try to execute the create statement manually in this server
>>> 2) if works try to run again with "append" option
>>>
>>> I would open a case with Cloudera and ask which driver you should use.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>>> wrote:
>>>
 As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
 saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
 Java client for instance) to access the Hive tables in Spark via the Thrift
 server interface.

 -- ND

 On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:

 I have trying to create table in hive from spark itself,

 And using local mode it will work what I am trying here is from spark
 standalone I want to create the manage table in hive (another spark cluster
 basically CDH) using jdbc mode.

 When I try that below are the error I am facing.

 On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, <
 mich.talebza...@gmail.com> wrote:

> Have you created that table in Hive or are you trying to create it
> from Spark itself.
>
> You Hive is local. In this case you don't need a JDBC connection. Have
> you tried:
>
> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>
> HTH
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Insert mode is "overwrite", it shouldn't doesn't matter if the table
already exists or not. The JDBC driver should be based on the Cloudera Hive
version, we can't know the CDH version he's using.

On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
wrote:

> The driver is fine and latest and  it should work.
>
> I have asked the thread owner to send the DDL of the table and how the
> table is created. In this case JDBC from Spark expects the table to be
> there.
>
> The error below
>
> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
> processing query/statement. Error Code: 4, SQL state:
> TStatus(statusCode:ERROR_STATUS,
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
> compiling statement: FAILED: ParseException line 1:39 cannot recognize
> input near '"first_name"' 'TEXT' ',' in column name or primary key or
> foreign key:28:27,
> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>
> Sounds like a mismatch between the columns through Spark Dataframe and the
> underlying table.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Badrinath is trying to write to a Hive in a cluster where he doesn't have
>> permission to submit spark jobs, he doesn't have Hive/Spark metadata
>> access.
>> The only way to communicate with this third-party Hive cluster is through
>> JDBC protocol.
>>
>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>
>> Who's creating this table is "Spark" because he's using "overwrite" in
>> order to test it.
>>
>>  df.write
>>   .format("jdbc")
>>   .option("url",
>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>   .option("dbtable", "test.test")
>>   .option("user", "admin")
>>   .option("password", "admin")
>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>> *  .mode("overwrite")*
>>   .save
>>
>> This error is weird, looks like the third-party Hive server isn't able to
>> recognize the SQL dialect coming from  [Spark Standalone] server JDBC
>> driver.
>>
>> 1) I would try to execute the create statement manually in this server
>> 2) if works try to run again with "append" option
>>
>> I would open a case with Cloudera and ask which driver you should use.
>>
>> Thanks
>>
>>
>>
>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>> wrote:
>>
>>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>>> Java client for instance) to access the Hive tables in Spark via the Thrift
>>> server interface.
>>>
>>> -- ND
>>>
>>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>>
>>> I have trying to create table in hive from spark itself,
>>>
>>> And using local mode it will work what I am trying here is from spark
>>> standalone I want to create the manage table in hive (another spark cluster
>>> basically CDH) using jdbc mode.
>>>
>>> When I try that below are the error I am facing.
>>>
>>> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Have you created that table in Hive or are you trying to create it from
 Spark itself.

 You Hive is local. In this case you don't need a JDBC connection. Have
 you tried:

 df2.write.mode("overwrite").saveAsTable(mydb.mytable)

 HTH




view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
 pbadrinath1...@gmail.com> wrote:

> Hi,
>
> Trying to write data in spark to the hive as JDBC mode below  is the
> sample code:
>
> spark standalone 2.4.7 version
>
> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes 
> where
> applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> Spark context Web UI available at 

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
The driver is fine and latest and  it should work.

I have asked the thread owner to send the DDL of the table and how the
table is created. In this case JDBC from Spark expects the table to be
there.

The error below

java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR processing
query/statement. Error Code: 4, SQL state:
TStatus(statusCode:ERROR_STATUS,
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
compiling statement: FAILED: ParseException line 1:39 cannot recognize
input near '"first_name"' 'TEXT' ',' in column name or primary key or
foreign key:28:27,
org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329

Sounds like a mismatch between the columns through Spark Dataframe and the
underlying table.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Badrinath is trying to write to a Hive in a cluster where he doesn't have
> permission to submit spark jobs, he doesn't have Hive/Spark metadata
> access.
> The only way to communicate with this third-party Hive cluster is through
> JDBC protocol.
>
> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>
> Who's creating this table is "Spark" because he's using "overwrite" in
> order to test it.
>
>  df.write
>   .format("jdbc")
>   .option("url",
> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>   .option("dbtable", "test.test")
>   .option("user", "admin")
>   .option("password", "admin")
>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
> *  .mode("overwrite")*
>   .save
>
> This error is weird, looks like the third-party Hive server isn't able to
> recognize the SQL dialect coming from  [Spark Standalone] server JDBC
> driver.
>
> 1) I would try to execute the create statement manually in this server
> 2) if works try to run again with "append" option
>
> I would open a case with Cloudera and ask which driver you should use.
>
> Thanks
>
>
>
> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
> wrote:
>
>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>> Java client for instance) to access the Hive tables in Spark via the Thrift
>> server interface.
>>
>> -- ND
>>
>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>
>> I have trying to create table in hive from spark itself,
>>
>> And using local mode it will work what I am trying here is from spark
>> standalone I want to create the manage table in hive (another spark cluster
>> basically CDH) using jdbc mode.
>>
>> When I try that below are the error I am facing.
>>
>> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Have you created that table in Hive or are you trying to create it from
>>> Spark itself.
>>>
>>> You Hive is local. In this case you don't need a JDBC connection. Have
>>> you tried:
>>>
>>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>>> pbadrinath1...@gmail.com> wrote:
>>>
 Hi,

 Trying to write data in spark to the hive as JDBC mode below  is the
 sample code:

 spark standalone 2.4.7 version

 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
 setLogLevel(newLevel).
 Spark context Web UI available at http://localhost:4040
 Spark context available as 'sc' (master = spark://localhost:7077, app
 id = app-20210715080414-0817).
 Spark session available as 'spark'.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
   /_/

 Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
 Type in expressions to have 

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Badrinath is trying to write to a Hive in a cluster where he doesn't have
permission to submit spark jobs, he doesn't have Hive/Spark metadata
access.
The only way to communicate with this third-party Hive cluster is through
JDBC protocol.

[ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]

Who's creating this table is "Spark" because he's using "overwrite" in
order to test it.

 df.write
  .format("jdbc")
  .option("url",
"jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
  .option("dbtable", "test.test")
  .option("user", "admin")
  .option("password", "admin")
  .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
*  .mode("overwrite")*
  .save

This error is weird, looks like the third-party Hive server isn't able to
recognize the SQL dialect coming from  [Spark Standalone] server JDBC
driver.

1) I would try to execute the create statement manually in this server
2) if works try to run again with "append" option

I would open a case with Cloudera and ask which driver you should use.

Thanks



On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
wrote:

> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
> Java client for instance) to access the Hive tables in Spark via the Thrift
> server interface.
>
> -- ND
>
> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>
> I have trying to create table in hive from spark itself,
>
> And using local mode it will work what I am trying here is from spark
> standalone I want to create the manage table in hive (another spark cluster
> basically CDH) using jdbc mode.
>
> When I try that below are the error I am facing.
>
> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
> wrote:
>
>> Have you created that table in Hive or are you trying to create it from
>> Spark itself.
>>
>> You Hive is local. In this case you don't need a JDBC connection. Have
>> you tried:
>>
>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>
>> HTH
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>> pbadrinath1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Trying to write data in spark to the hive as JDBC mode below  is the
>>> sample code:
>>>
>>> spark standalone 2.4.7 version
>>>
>>> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>> Spark context Web UI available at http://localhost:4040
>>> Spark context available as 'sc' (master = spark://localhost:7077, app id
>>> = app-20210715080414-0817).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>>>   /_/
>>>
>>> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> :paste
>>> // Entering paste mode (ctrl-D to finish)
>>>
>>> val df = Seq(
>>> ("John", "Smith", "London"),
>>> ("David", "Jones", "India"),
>>> ("Michael", "Johnson", "Indonesia"),
>>> ("Chris", "Lee", "Brazil"),
>>> ("Mike", "Brown", "Russia")
>>>   ).toDF("first_name", "last_name", "country")
>>>
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>>   .mode("overwrite")
>>>   .save
>>>
>>>
>>> // Exiting paste mode, now interpreting.
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
>>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
>>> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,