[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

2018-12-19 Thread luby
Hi, All, We are starting to migrate our data to Hadoop platform in hoping to use 'Big Data' technologies to improve our business. We are new in the area and want to get some help from you. Currently all our data is put into Hive and some complicated SQL query statements are run daily. We

Spark not working with Hadoop 4mc compression

2018-12-19 Thread Abhijeet Kumar
Hello, I’m using 4mc compression in my Hadoop and when I’m reading file from hdfs, it’s throwing error. https://github.com/carlomedas/4mc I’m doing simple query in sc.textFile("/store.csv").getNumPartitions Error: java.lang.RuntimeException: Error in configuring object at

Re: Read Time from a remote data source

2018-12-19 Thread Jiaan Geng
First, Spark worker not have the ability to compute.In fact,executor is responsible for computation. Executor running tasks is distributed by driver. Each Task just read some section of data in normal, but the stage have only one partition. IF your operators not contains the operator that will

[Spark SQL]use zstd, No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD

2018-12-19 Thread 李斌松
Import parquet-hadoop-bundle jar. into the spark hive project When you compress data using zstd, you may load it preferentially from the parquet-hadoop-bundle, and you canundefinedt find the enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD > > 18/12/20 10:35:28 ERROR Executor:

Fwd: Train multiple machine learning models in parallel

2018-12-19 Thread Pola Yao
Hi Comminuty, I have a 1T dataset which contains records for 50 users. Each user has 20G data averagely. I wanted to use spark to train a machine learning model (e.g., XGBoost tree model) for each user. Ideally, the result should be 50 models. However, it'd be infeasible to submit 50 spark jobs

Re: question about barrier execution mode in Spark 2.4.0

2018-12-19 Thread Xiangrui Meng
On Mon, Nov 12, 2018 at 7:33 AM Joe wrote: > Hello, > I was reading Spark 2.4.0 release docs and I'd like to find out more > about barrier execution mode. > In particular I'd like to know what happens when number of partitions > exceeds number of nodes (which I think is allowed, Spark tuning doc

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Mich Talebzadeh
Thanks Sam. Looks interesting. I will have a look in details and let you know. Best, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Read Time from a remote data source

2018-12-19 Thread swastik mittal
I am running a model where the workers should not have the data stored in them. They are only for execution purpose. The other cluster (its just a single node) which I am receiving data from is just acting as a file server, for which I could have used any other way like nfs or ftp. So I went with

[no subject]

2018-12-19 Thread Daniel O' Shaughnessy
unsubscribe

[Spark Core] Support for parquet column indexes

2018-12-19 Thread Kamil Krzysztof Krynicki
Hello, Recently there has been an addition to the parquet files. Namely, the column indexes. See: https://stackoverflow.com/questions/26909543/index-in-parquet/40714337#40714337 Available since parquet encoder 1.11, parquet format 2.5. It seems to improve the IO performance by an order of

Spark Kafka Streaming with Offset Gaps

2018-12-19 Thread Rishabh Pugalia
I have an app that uses Kafka Streaming to pull data from `input` topic and push to `output` topic with `processing.guarantee=exactly_once`. Due to `exactly_once` gaps (transaction markers) are created in Kafka. Let's call this app `kafka-streamer`. Now I've another app that listens to this

Spark Kafka Streaming with Offset Gaps

2018-12-19 Thread Rishabh Pugalia
I have an app that uses Kafka Streaming to pull data from `input` topic and push to `output` topic with `processing.guarantee=exactly_once`. Due to `exactly_once` gaps (transaction markers) are created in Kafka. Let's call this app `kafka-streamer`. Now I've another app that listens to this

Re: Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-19 Thread Jiaan Geng
This SQL syntax is not supported now!Please use ALTER TABLE ... CHANGE COLUMN . -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Multiple sessions in one application?

2018-12-19 Thread Jean Georges Perrin
Hi there, I was curious of what use cases would drive the use of newSession() (as in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#newSession-- ). I

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Sam Elamin
Hi Mich I wrote a connector to make it easier to connect Bigquery and Spark Have a look here https://github.com/samelamin/spark-bigquery/ Your feedback is always welcome Kind Regards Sam On Tue, Dec 18, 2018 at 7:46 PM Mich Talebzadeh wrote: > Thanks Jorn. I will try that. Requires