Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread Tathagata Das
Yes. Yes you can. On Tue, Jul 17, 2018 at 11:42 AM, Sathi Chowdhury wrote: > Hi, > My question is about ability to integrate spark streaming with multiple > clusters.Is it a supported use case. An example of that is that two topics > owned by different group and they have their own kakka infra

Re: Dataframe from partitioned parquet table missing partition columns from schema

2018-07-17 Thread Nirav Patel
Just found out that I need following option while reading: .option("basePath", "hdfs://localhost:9000/ptest/") https://stackoverflow.com/questions/43192940/why-is-partition-key-column-missing-from-dataframe On Tue, Jul 17, 2018 at 3:48 PM, Nirav Patel wrote: > I created a hive table with

Dataframe from partitioned parquet table missing partition columns from schema

2018-07-17 Thread Nirav Patel
I created a hive table with parquet storage using sparkSql. Now in hive cli when I do describe and Select I can see partition columns in both as regular columns as well as partition column. However if I try to do same in sparkSql (Dataframe) I don't see partition columns. I need to do projection

Re: Pyspark access to scala/java libraries

2018-07-17 Thread Mohit Jaggi
Thanks 0xF0F0F0 and Ashutosh for the pointers. Holden, I am trying to look into sparklingml...what am I looking for? Also which chapter/page of your book should I look at? Mohit. On Sun, Jul 15, 2018 at 3:02 AM Holden Karau wrote: > If you want to see some examples in a library shows a way to

Re: Heap Memory in Spark 2.3.0

2018-07-17 Thread Imran Rashid
perhaps this is https://issues.apache.org/jira/browse/SPARK-24578? that was reported as a performance issue, not OOMs, but its in the exact same part of the code and the change was to reduce the memory pressure significantly. On Mon, Jul 16, 2018 at 1:43 PM, Bryan Jeffrey wrote: > Hello. > > I

joining streams from multiple kafka clusters

2018-07-17 Thread sathich
Hi, My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra . Can i have two dataframes as a result of spark.readstream listening to different

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread sathich
this may work val df_post= listCustomCols .foldLeft(df_pre){(tempDF, listValue) => tempDF.withColumn( listValue.name, new Column(listValue.name.toString + funcUDF(listValue.name)) ) and outsource the renaming to an udf or you can rename the

Spark streaming connecting to two kafka clusters

2018-07-17 Thread Sathi Chowdhury
Hi,My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra .Can i have two dataframes as a result of spark.readstream listening to different

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread Sathi Chowdhury
Hi,My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra .Can i have two dataframes as a result of spark.readstream listening to different

Re: Running Production ML Pipelines

2018-07-17 Thread Shmuel Blitz
Hi, This is a very general question. It's hard to andswer your question without fully understanding your business and technological needs. You might want to watch this video: https://www.youtube.com/watch?v=2UKSLHDH5vc=8s Shmuel On Tue, Jul 17, 2018 at 12:11 AM Gautam Singaraju <

Re:Re: spark sql data skew

2018-07-17 Thread 崔苗
30G user data, how to get distinct users count after creating a composite key based on company and userid? 在 2018-07-13 18:24:52,Jean Georges Perrin 写道: Just thinking out loud… repartition by key? create a composite key based on company and userid? How big is your dataset? On Jul 13, 2018,

Query on Profiling Spark Code

2018-07-17 Thread Aakash Basu
Hi guys, I'm trying to profile my Spark code on cProfiler and check where more time is taken. I found the most time taken is by some socket object, which I'm quite clueless of, as to where it is used. Can anyone shed some light on this?