date:20180717

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread Tathagata Das

Yes. Yes you can. On Tue, Jul 17, 2018 at 11:42 AM, Sathi Chowdhury wrote: > Hi, > My question is about ability to integrate spark streaming with multiple > clusters.Is it a supported use case. An example of that is that two topics > owned by different group and they have their own kakka infra .

Re: Dataframe from partitioned parquet table missing partition columns from schema

2018-07-17 Thread Nirav Patel

Just found out that I need following option while reading: .option("basePath", "hdfs://localhost:9000/ptest/") https://stackoverflow.com/questions/43192940/why-is-partition-key-column-missing-from-dataframe On Tue, Jul 17, 2018 at 3:48 PM, Nirav Patel wrote: > I created a hive table with pa

Dataframe from partitioned parquet table missing partition columns from schema

2018-07-17 Thread Nirav Patel

I created a hive table with parquet storage using sparkSql. Now in hive cli when I do describe and Select I can see partition columns in both as regular columns as well as partition column. However if I try to do same in sparkSql (Dataframe) I don't see partition columns. I need to do projection o

Re: Pyspark access to scala/java libraries

2018-07-17 Thread Mohit Jaggi

Thanks 0xF0F0F0 and Ashutosh for the pointers. Holden, I am trying to look into sparklingml...what am I looking for? Also which chapter/page of your book should I look at? Mohit. On Sun, Jul 15, 2018 at 3:02 AM Holden Karau wrote: > If you want to see some examples in a library shows a way to

Re: Heap Memory in Spark 2.3.0

2018-07-17 Thread Imran Rashid

perhaps this is https://issues.apache.org/jira/browse/SPARK-24578? that was reported as a performance issue, not OOMs, but its in the exact same part of the code and the change was to reduce the memory pressure significantly. On Mon, Jul 16, 2018 at 1:43 PM, Bryan Jeffrey wrote: > Hello. > > I

joining streams from multiple kafka clusters

2018-07-17 Thread sathich

Hi, My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra . Can i have two dataframes as a result of spark.readstream listening to different kafk

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread sathich

this may work val df_post= listCustomCols .foldLeft(df_pre){(tempDF, listValue) => tempDF.withColumn( listValue.name, new Column(listValue.name.toString + funcUDF(listValue.name)) ) and outsource the renaming to an udf or you can rename the c

Spark streaming connecting to two kafka clusters

2018-07-17 Thread Sathi Chowdhury

Hi,My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra .Can i have two dataframes as a result of spark.readstream listening to different kaf

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread Sathi Chowdhury

Hi,My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra .Can i have two dataframes as a result of spark.readstream listening to different kaf

Re: Running Production ML Pipelines

2018-07-17 Thread Shmuel Blitz

Hi, This is a very general question. It's hard to andswer your question without fully understanding your business and technological needs. You might want to watch this video: https://www.youtube.com/watch?v=2UKSLHDH5vc&t=8s Shmuel On Tue, Jul 17, 2018 at 12:11 AM Gautam Singaraju < gautam.singa

Re:Re: spark sql data skew

2018-07-17 Thread 崔苗

30Ｇ user data, how to get distinct users count after creating a composite key based on company and userid? 在 2018-07-13 18:24:52，Jean Georges Perrin 写道： Just thinking out loud… repartition by key? create a composite key based on company and userid? How big is your dataset? On Jul 13, 2018, a

Query on Profiling Spark Code

2018-07-17 Thread Aakash Basu

Hi guys, I'm trying to profile my Spark code on cProfiler and check where more time is taken. I found the most time taken is by some socket object, which I'm quite clueless of, as to where it is used. Can anyone shed some light on this? ncallstottimepercallcumtimepercallfilename:lineno(function

Re: Dataset - withColumn and withColumnRenamed that accept Column type

Re: Dataframe from partitioned parquet table missing partition columns from schema

Dataframe from partitioned parquet table missing partition columns from schema

Re: Pyspark access to scala/java libraries

Re: Heap Memory in Spark 2.3.0

joining streams from multiple kafka clusters

Re: Dataset - withColumn and withColumnRenamed that accept Column type

Spark streaming connecting to two kafka clusters

Re: Dataset - withColumn and withColumnRenamed that accept Column type

Re: Running Production ML Pipelines

Re:Re: spark sql data skew

Query on Profiling Spark Code

12 matches

Site Navigation

Mail list logo

Footer information