Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread ayan guha
Try setting following Param: conf.set("spark.sql.hive.convertMetastoreParquet","false") On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Do you use df.write or you make with hivecontext.sql(" insert into ...")? > > Angel. > > El 12 jun.

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Angel Francisco Orta
Hello, Do you use df.write or you make with hivecontext.sql(" insert into ...")? Angel. El 12 jun. 2017 11:07 p. m., "Yong Zhang" escribió: > We are using Spark *1.6.2* as ETL to generate parquet file for one > dataset, and partitioned by "brand" (which is a string to

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim
Hi Bo, +1 for your project. I come from the world of data warehouses, ETL, and reporting analytics. There are many individuals who do not know or want to do any coding. They are content with ANSI SQL and stick to it. ETL workflows are also done without any coding using a drag-and-drop user

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread bo yang
Hi Aakash, Thanks for your willing to help :) It will be great if I could get more feedback on my project. For example, is there any other people feeling the need of using a script to write Spark job easily? Also, I would explore whether it is possible that the Spark project takes some work to

Re: Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread lucas.g...@gmail.com
AFAIK the process a spark program follows is: 1. A set of transformations are defined on a given input dataset. 2. At some point an action is called 1. In your case this is writing to your parquet file. 3. When that happens spark creates a logical plan and then a physical plan

Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread Henry M
I am trying to understand if I should be concerned about this warning: "WARN Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf" It occurs while writing a data frame to

broadcast() multiple times the same df. Is it cached ?

2017-06-12 Thread matd
Hi spark folks, In our application, we have to join a dataframe with several other df (not always the same joining column). This left-hand side df is not very large, so a broadcast hint may be beneficial. My questions : - if the same df get broadcast multiple times, will the transfer occur once

Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Yong Zhang
We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and partitioned by "brand" (which is a string to represent brand in this dataset). After the partition files generated in HDFS like "brand=a" folder, we add the partitions in the Hive. The hive version is 1.2.1 (In

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Thakrar, Jayesh
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ? From: Patrik Medvedev Date: Monday, June 12, 2017 at 2:31 AM To: Jörn Franke , vaquar khan Cc: Jean Georges Perrin , User

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Aakash Basu
Hey, I work on Spark SQL and would pretty much be able to help you in this. Let me know your requirement. Thanks, Aakash. On 12-Jun-2017 11:00 AM, "bo yang" wrote: > Hi Guys, > > I am writing a small open source project > to use

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-12 Thread Mohammed Guller
Regarding Spark scheduler – if you are referring to the ability to distribute workload and scale, Kafka Streaming also provides that capability. It is deceptively simple in that regard if you already have a Kafka cluster. You can launch multiple instances of your Kafka streaming application and

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-12 Thread Rastogi, Pankaj
Please make sure that you have enough memory available on the driver node. If there is not enough free memory on the driver node, then your application won't start. Pankaj From: vaquar khan > Date: Saturday, June 10, 2017 at 5:02 PM To:

Re: [How-To] Custom file format as source

2017-06-12 Thread Vadim Semenov
It should be easy to start with a custom Hadoop InputFormat that reads the file and creates a `RDD[Row]`, since you know the records size, it should be pretty easy to make the InputFormat to produce splits, so then you could read the file in parallel. On Mon, Jun 12, 2017 at 6:01 AM, OBones

Re: SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Just add more information how I build the custom distribution. I clone spark repo then switch to branch 2.2 then make distribution that following. λ ~/workspace/big_data/spark/ branch-2.2* λ ~/workspace/big_data/spark/ ./dev/make-distribution.sh --name custom --tgz -Phadoop-2.7

SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Hi everyone, Recently I discovered an issue when processing csv of spark. So I decided to fix it following this https://issues.apache.org/jira/browse/SPARK-21024 I built a custom distribution for internal uses. I built it in my local machine then upload the distribution to server. server's

RE: [How-To] Custom file format as source

2017-06-12 Thread Mendelson, Assaf
Try https://mapr.com/blog/spark-data-source-api-extending-our-spark-sql-query-engine/ Thanks, Assaf. -Original Message- From: OBones [mailto:obo...@free.fr] Sent: Monday, June 12, 2017 1:01 PM To: user@spark.apache.org Subject: [How-To] Custom file format as source

[How-To] Custom file format as source

2017-06-12 Thread OBones
Hello, I have an application here that generates data files in a custom binary format that provides the following information: Column list, each column has a data type (64 bit integer, 32 bit string index, 64 bit IEEE float, 1 byte boolean) Catalogs that give modalities for some columns (ie,

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Patrik Medvedev
Hello, All security checkings disabled, but i still don't have any info in result. вс, 11 июн. 2017 г. в 14:24, Jörn Franke : > Is sentry preventing the access? > > On 11. Jun 2017, at 01:55, vaquar khan wrote: > > Hi , > Pleaae check your firewall