common logging in spark

2019-05-01 Thread rajat kumar
Hi All, I have heard that log4j will not able to work properly. I have been told to use logger in scala code. Is there any pointer for that? Thanks for help in advance rajat

Spark SQL LIMIT Gets Stuck

2019-05-01 Thread Shahab Yunus
Hi there. I have a Hive external table (storage format is ORC, data stored on S3, partitioned on one bigint type column) that I am trying to query through pyspark (or spark-shell) shell. df.count() fails with lower values of LIMIT clause with the following exception (seen in Spark UI.) df.show()

Re: Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-05-01 Thread Jeff Zhang
You can configure zeppelin to store your notes in S3 http://zeppelin.apache.org/docs/0.8.1/setup/storage/storage.html#notebook-storage-in-s3 V0lleyBallJunki3 于2019年5月1日周三 上午5:26写道: > Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache > Spark programs in Scala. The

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
Hi Anastasios, Thanks for this. I have a few doubts with this approach. The dropDuplicate operation will keep all the data across triggers. 1. Where is this data stored? - IN_MEMORY state means the data is not persisted during job resubmit. - Persistence in disk like HDFS has

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Anastasios Zouzias
Hi, Have you checked the docs, i.e., https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication You can generate a uuid column in your streaming DataFrame and drop duplicate messages with a single line of code. Best, Anastasios On Wed, May 1, 2019

Re: Update / Delete records in Parquet

2019-05-01 Thread Vitaliy Pisarev
Ankit, you should take a look at delta.io that was recently open sourced by databricks. Full DML support is on the way. From: "Khare, Ankit" Date: Tuesday, 23 April 2019 at 11:35 To: Chetan Khatri , Jason Nerothin Cc: user Subject: Re: Update / Delete records in Parquet

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
Hi All, Floating this again. Any suggestions? Akshay Bhardwaj +91-97111-33849 On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Experts, > > I am using spark structured streaming to read message from Kafka, with a > producer that works with at-least

Error while using spark-avro module in pyspark 2.4

2019-05-01 Thread kanchan tewary
Hi All, Greetings! I am facing an error while trying to write my dataframe into avro format, using spark-avro package ( https://spark.apache.org/docs/latest/sql-data-sources-avro.html#deploying). I have added the package while running spark-submit as follows. Do I need to add any additional