Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Sorry I meant Spark 2.4 in my previous email On Tue, Mar 6, 2018 at 9:15 PM, kant kodali wrote: > Hi TD, > > I agree I think we are better off either with a full fix or no fix. I am > ok with the complete fix being available in master or some branch. I guess > the solution

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Hi TD, I agree I think we are better off either with a full fix or no fix. I am ok with the complete fix being available in master or some branch. I guess the solution for me is to just build from the source. On a similar note, I am not finding any JIRA tickets related to full outer joins and

Spark StreamingContext Question

2018-03-06 Thread रविशंकर नायर
Hi all, Understand from documentation that, only one streaming context can be active in a JVM at the same time. Hence in an enterprise cluster, how can we manage/handle multiple users are having many different streaming applications, one may be ingesting data from Flume, another from Twitter

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread Tathagata Das
I thought about it. I am not 100% sure whether this fix should go into 2.3.1. There are two parts to this bug fix to enable self-joins. 1. Enabling deduping of leaf logical nodes by extending MultInstanceRelation - This is safe to be backported into the 2.3 branch as it does not touch

[Spark CSV DataframeWriter] Quote options for columns on write

2018-03-06 Thread Brandon Geise
My problem is related to the need to have all records in a specific column quoted when writing a CSV.  I assumed that by setting the options escapeQuotes to false in the options, that fields would not have any type of quoting applied, even when that delimiter exists.  Unless I am

Re: CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread warning

2018-03-06 Thread Junfeng Chen
Spark 2.1.1. Actually it is a warning rather than an exception, so there is no stack trace. Just many this line: > CachedKafkaConsumer: CachedKafkaConsumer is not running in > UninterruptibleThread. It may hang when CachedKafkaConsumer's method are > interrupted because of KAFKA-1894. Regard,

dependencies conflict in oozie spark action for spark 2

2018-03-06 Thread Lian Jiang
I am using HDP 2.6.4 and have followed https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html to make oozie use spark2. After this, I found there are still a bunch of issues: 1. oozie and spark tries to add the same jars multiple time

Dynamic allocation Spark Stremaing

2018-03-06 Thread KhajaAsmath Mohammed
Hi, I have enabled dynamic allocation for spark streaming application but the number of containers always shows as 2. Is there a way to get more when job is running? Thanks, Asmath

Re: CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread warning

2018-03-06 Thread Tathagata Das
Which version of Spark are you using? And can you give us the full stack trace of the exception? On Tue, Mar 6, 2018 at 1:53 AM, Junfeng Chen wrote: > I am trying to read kafka and save the data as parquet file on hdfs > according to this

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-06 Thread Chawla,Sumit
No, This is the only Stack trace i get. I have tried DEBUG but didn't notice much of a log change. Yes, I have tried bumping MaxDirectMemorySize to get rid of this error. It does work if i throw 4G+ memory at it. However, I am trying to understand this behavior so that i can setup this

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-06 Thread Vadim Semenov
Do you have a trace? i.e. what's the source of `io.netty.*` calls? And have you tried bumping `-XX:MaxDirectMemorySize`? On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit wrote: > Hi All > > I have a job which processes a large dataset. All items in the dataset > are

Distributed Nature of Spark and Time Series Temporal Dependence

2018-03-06 Thread arshanvit
Hi All, I am new to Spark and I am trying to use forecasting models on time-series data.As per my understanding,the Spark Dataframes are distributed collection of data.This distributed nature can attribute that chunks of data will not be dependent on each other and are possibly treated separately

CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread warning

2018-03-06 Thread Junfeng Chen
I am trying to read kafka and save the data as parquet file on hdfs according to this https://stackoverflow.com/questions/45827664/read-from-kafka-and-write-to-hdfs-in-parquet The code is similar to :

Re: Properly stop applications or jobs within the application

2018-03-06 Thread bsikander
It seems to be related to this issue from Kafka https://issues.apache.org/jira/browse/KAFKA-1894 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org