Re: spark-jdbc impala with kerberos using yarn-client

2017-09-05 Thread morfious902002
I was able to query data from Impala table. Here is my git repo for anyone who would like to check it :- https://github.com/morfious902002/impala-spark-jdbc-kerberos -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: spark-jdbc impala with kerberos using yarn-client

2017-07-03 Thread morfious902002
Did you ever find a solution to this? If so, can you share your solution? I am running into similar issue in YARN cluster mode connecting to impala table. -- View this message in context:

Re: Creating Dataframe by querying Impala

2017-06-01 Thread morfious902002
The issue seems to be with primordial class loader. I cannot load the drivers to all the nodes at the same location but have loaded the jars to HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node with no luck. Is there another way to load these jars through

Creating Dataframe by querying Impala

2017-05-31 Thread morfious902002
Hi, I am trying to create a Dataframe by querying Impala Table. It works fine in my local environment but when I try to run it in cluster I either get Error:java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver or No Suitable Driver found. Can someone help me or direct me to

Saving parquet file in Spark giving error when Encryption at Rest is implemented

2017-01-30 Thread morfious902002
We are using spark 1.6.1 on a CDH 5.5 cluster. The job worked fine with Kerberos but when we implemented Encryption at Rest we ran into the following issue:- Df.write().mode(SaveMode.Append).partitionBy("Partition").parquet(path); I have already tried setting these values with no success :-

Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all the work is being done by one thread. Why is that? Also, I need parquet.enable.summary-metadata to register the parquet files to Impala. Df.write().partitionBy("COLUMN").parquet(outputFileLocation); It also, seems

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread morfious902002
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. I am using spark 1.5.1 with YARN. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if

EC2 cluster created by spark using old HDFS 1.0

2015-03-20 Thread morfious902002
Hi, I created a cluster using spark-ec2 script. But it installs HDFS version 1.0. I would like to use this cluster to connect to HIVE installed on a cloudera CDH 5.3 cluster. But I am getting the following error:- org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate

Create a Spark cluster with cloudera CDH 5.2 support

2015-03-20 Thread morfious902002
Hi, I am trying to create a Spark cluster using the spark-ec2 script which will support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by adding --hadoop-major-version=2.5.0 which solved some of the errors I was getting. But now when I run select query on hive I get the following