strange behavior of joining dataframes

2018-03-20 Thread Shiyuan
Hi Spark-users: I have a dataframe "df_t" which was generated from other dataframes by several transformations. And then I did something very simple, just counting the rows, that is the following code: (A) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnRenamed("count", "cnt1") df_t_2 =

Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-20 Thread Yinan Li
One option is the Spark Operator . It allows specifying and running Spark applications on Kubernetes using Kubernetes custom resources objects. It takes SparkApplication CRD objects and automatically submits the applications to run on a

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Gurusamy Thirupathy
HI Jorn, Thanks for your sharing different options, yes we are trying to build a generic tool for Hive to Spark export. FYI, currently we are using sqoop, we are trying to migrate from sqoop to spark. Thanks -G On Tue, Mar 20, 2018 at 2:17 AM, Jörn Franke wrote: > Write

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-20 Thread purna pradeep
Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3 ,now i want to run spark-submit from AWS lambda function to k8s master,would like to know if there is any REST interface to run Spark submit on k8s Master

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
Thanks Michael! that works! On Tue, Mar 20, 2018 at 5:00 PM, Michael Armbrust wrote: > Those options will not affect structured streaming. You are looking for > > .option("maxOffsetsPerTrigger", "1000") > > We are working on improving this by building a generic

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Michael Armbrust
Those options will not affect structured streaming. You are looking for .option("maxOffsetsPerTrigger", "1000") We are working on improving this by building a generic mechanism into the Streaming DataSource V2 so that the engine can do admission control on the amount of data returned in a

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Ok, this one works: .withColumn("hour", hour(from_unixtime(typedDataset.col("ts") / 1000))) 2018-03-20 22:43 GMT+01:00 Serega Sheypak : > Hi, any updates? Looks like some API inconsistency or bug..? > > 2018-03-17 13:09 GMT+01:00 Serega Sheypak

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
I am using spark 2.3.0 and Kafka 0.10.2.0 so I assume structured streaming using Direct API's although I am not sure? If it is direct API's the only parameters that are relevant are below according to this

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Hi, any updates? Looks like some API inconsistency or bug..? 2018-03-17 13:09 GMT+01:00 Serega Sheypak : > > Not sure why you are dividing by 1000. from_unixtime expects a long type > It expects seconds, I have milliseconds. > > > > 2018-03-12 6:16 GMT+01:00 vermanurag

Re: [Structured Streaming] Query Metrics to MetricsSink

2018-03-20 Thread lucas-vsco
It actually looks like I might have the answers via these following links: [Design] Metrics in Structured Streaming JIRA - Structured Streaming - Metrics

Re: Access Table with Spark Dataframe

2018-03-20 Thread hemant singh
See if this helps - https://stackoverflow.com/questions/42852659/makiing-sql-request-on-columns-containing-dot enclosing column names in "`" On Tue, Mar 20, 2018 at 6:47 PM, SNEHASISH DUTTA wrote: > Hi, > > I am using Spark 2.2 , a table fetched from database contains

[Structured Streaming] Query Metrics to MetricsSink

2018-03-20 Thread lucas-vsco
I am looking to take the metrics exposed in the logs from MicroBatchExecution below and submit them as stats to implemented MetricsSinks. 2018-03-20 10:28:48 INFO MicroBatchExecution:54 - Streaming query made progress: { "id" : "42bb5c95-980d-480d-9dee-72e1baf6a5b3", "runId" :

Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-03-20 Thread dcam
I'm just circling back to this now. Is the commit protocol an acceptable way of making this configureable? I could make the temp path (currently "_temporary") configureable if that is what you are referring to. Michael Armbrust wrote > We didn't go this way initially because it doesn't work on

Access Table with Spark Dataframe

2018-03-20 Thread SNEHASISH DUTTA
Hi, I am using Spark 2.2 , a table fetched from database contains a (.) dot in one of the column names. Whenever I am trying to select that particular column I am getting query analysis exception. I have tried creating a temporary table , using createOrReplaceTempView() and fetch the column's

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Geoff Von Allmen
The following settings may be what you’re looking for: - spark.streaming.backpressure.enabled - spark.streaming.backpressure.initialRate - spark.streaming.receiver.maxRate -

the meaining of "samplePointsPerPartitionHint" in RangePartitioner

2018-03-20 Thread 1427357...@qq.com
HI all, The belowing is the code of RangePartitioner. class RangePartitioner[K : Ordering : ClassTag, V]( partitions: Int, rdd: RDD[_ <: Product2[K, V]], private var ascending: Boolean = true, val samplePointsPerPartitionHint: Int = 20) I feel puzzled about the

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Jörn Franke
Write your own Spark UDF. Apply it to all varchar columns. Within this udf you can use the SimpleDateFormat parse method. If this method returns null you return the content as varchar if not you return a date. If the content is null you return null. Alternatively you can define an insert