date:20180320

strange behavior of joining dataframes

2018-03-20 Thread Shiyuan

Hi Spark-users:
I have a dataframe "df_t" which was generated from other dataframes by
several transformations. And then I  did something very simple,  just
counting the rows, that is the following code:

(A)
df_t_1 =  df_t.groupby(["Id","key"]).count().withColumnRenamed("count",
"cnt1")
df_t_2 = df_t.groupby("Id").count().withColumnRenamed("count", "cnt2")
df_t_3 = df_t_1.join(df_t_2, ["Id"])
df_t.join(df_t_3, ["Id","key"])

When I run this query, I got the error that  "key" is missing during
joining. However, the column "key" is clearly in the dataframe dt.  What is
strange is that: if I first do this:

 data = df_t.collect(); df_t = spark.createDataFrame(data);  (B)

then (A) can run without error.  However,  the code (B) should not change
the dataframe dt_t at all.  Why the snippet (A) can run with (B) but
failed without (B)?  Also, A different joining sequence can also complete
without error:

(C)
df_t_1 =  df_t.groupby(["Id","key"]).count().withColumnRenamed("count",
"cnt1")
df_t_2 = df_t.groupby("Id").count().withColumnRenamed("count", "cnt2")
df_t.join(df_t_1, ["Id","key"]).join(df_t_2, ["Id"])

But (A) and (C) are conceptually the same and  should produce the same
result.  What could possibly go wrong here?  Any hints to track down
the problem is appreciated.  I am using spark 2.1.

Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-20 Thread Yinan Li

One option is the Spark Operator
. It allows
specifying and running Spark applications on Kubernetes using Kubernetes
custom resources objects. It takes SparkApplication CRD objects and
automatically submits the applications to run on a Kubernetes cluster.

Yinan

On Tue, Mar 20, 2018 at 7:47 PM, purna pradeep 
wrote:

> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
> ,now i want to run spark-submit from AWS lambda function to k8s
> master,would like to know if there is any REST interface to run Spark
> submit on k8s Master

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Gurusamy Thirupathy

HI Jorn,

Thanks for your sharing different options, yes we are trying to build a
generic tool for Hive to Spark export.
FYI, currently we are using sqoop, we are trying to migrate from sqoop to
spark.

Thanks
-G

On Tue, Mar 20, 2018 at 2:17 AM, Jörn Franke  wrote:

> Write your own Spark UDF. Apply it to all varchar columns.
>
> Within this udf you can use the SimpleDateFormat parse method. If this
> method returns null you return the content as varchar if not you return a
> date. If the content is null you return null.
>
> Alternatively you can define an insert function as pl/sql on Oracle side.
>
> Another alternative is to read the Oracle metadata for the table at
> runtime and then adapt your conversion based on this.
>
> However, this may not be perfect depending on your use case. Can you
> please provide more details/examples? Do you aim at a generic hive to
> Oracle import tool using Spark? Sqoop would not be an alternative?
>
> On 20. Mar 2018, at 03:45, Gurusamy Thirupathy 
> wrote:
>
> Hi guha,
>
> Thanks for your quick response, option a and b are in our table already.
> For option b, again the same problem, we don't know which column is date.
>
>
> Thanks,
> -G
>
> On Sun, Mar 18, 2018 at 9:36 PM, Deepak Sharma 
> wrote:
>
>> The other approach would to write to temp table and then merge the data.
>> But this may be expensive solution.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to read data from Hive as DataFrame, then trying to write
>>> the DF into the Oracle data base. In this case, the date field/column in
>>> hive is with Type Varchar(20)
>>> but the corresponding column type in Oracle is Date. While reading from
>>> hive , the hive table names are dynamically decided(read from another
>>> table) based on some job condition(ex. Job1). There are multiple tables
>>> like this, so column and the table names are decided only run time. So I
>>> can't do type conversion explicitly when read from Hive.
>>>
>>> So is there any utility/api available in Spark to achieve this
>>> conversion issue?
>>>
>>>
>>> Thanks,
>>> Guru
>>>
>>
>
>
> --
> Thanks,
> Guru
>
>


-- 
Thanks,
Guru

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-20 Thread purna pradeep

Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
,now i want to run spark-submit from AWS lambda function to k8s
master,would like to know if there is any REST interface to run Spark
submit on k8s Master

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali

Thanks Michael! that works!

On Tue, Mar 20, 2018 at 5:00 PM, Michael Armbrust 
wrote:

> Those options will not affect structured streaming.  You are looking for
>
> .option("maxOffsetsPerTrigger", "1000")
>
> We are working on improving this by building a generic mechanism into the
> Streaming DataSource V2 so that the engine can do admission control on the
> amount of data returned in a source independent way.
>
> On Tue, Mar 20, 2018 at 2:58 PM, kant kodali  wrote:
>
>> I am using spark 2.3.0 and Kafka 0.10.2.0 so I assume structured
>> streaming using Direct API's although I am not sure? If it is direct API's
>> the only parameters that are relevant are below according to this
>> 
>> article
>>
>>- spark.conf("spark.streaming.backpressure.enabled", "true")
>>- spark.conf("spark.streaming.kafka.maxRatePerPartition", "1")
>>
>> I set both of these and I run select count * on my 10M records I still
>> don't see any output until it finishes the initial batch of 10M and this
>> takes a while. so I am wondering if I miss something here?
>>
>> On Tue, Mar 20, 2018 at 6:09 AM, Geoff Von Allmen > > wrote:
>>
>>> The following
>>>  
>>> settings
>>> may be what you’re looking for:
>>>
>>>- spark.streaming.backpressure.enabled
>>>- spark.streaming.backpressure.initialRate
>>>- spark.streaming.receiver.maxRate
>>>- spark.streaming.kafka.maxRatePerPartition
>>>
>>> 
>>>
>>> On Mon, Mar 19, 2018 at 5:27 PM, kant kodali  wrote:
>>>
 Yes it indeed makes sense! Is there a way to get incremental counts
 when I start from 0 and go through 10M records? perhaps count for every
 micro batch or something?

 On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen <
 ge...@ibleducation.com> wrote:

> Trigger does not mean report the current solution every 'trigger
> seconds'. It means it will attempt to fetch new data and process it no
> faster than trigger seconds intervals.
>
> If you're reading from the beginning and you've got 10M entries in
> kafka, it's likely pulling everything down then processing it completely
> and giving you an initial output. From here on out, it will check kafka
> every 1 second for new data and process it, showing you only the updated
> rows. So the initial read will give you the entire output since there is
> nothing to be 'updating' from. If you add data to kafka now that the
> streaming job has completed it's first batch (and leave it running), it
> will then show you the new/updated rows since the last batch every 1 
> second
> (assuming it can fetch + process in that time span).
>
> If the combined fetch + processing time is > the trigger time, you
> will notice warnings that it is 'falling behind' (I forget the exact
> verbiage, but something to the effect of the calculation took XX time and
> is falling behind). In that case, it will immediately check kafka for new
> messages and begin processing the next batch (if new messages exist).
>
> Hope that makes sense -
>
>
> On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:
>
>> Hi All,
>>
>> I have 10 million records in my Kafka and I am just trying to
>> spark.sql(select count(*) from kafka_view). I am reading from kafka and
>> writing to kafka.
>>
>> My writeStream is set to "update" mode and trigger interval of one
>> second (Trigger.ProcessingTime(1000)). I expect the counts to be
>> printed every second but looks like it would print after going through 
>> all
>> 10M. why?
>>
>> Also, it seems to take forever whereas Linux wc of 10M rows would
>> take 30 seconds.
>>
>> Thanks!
>>
>

>>>
>>
>

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Michael Armbrust

Those options will not affect structured streaming.  You are looking for

.option("maxOffsetsPerTrigger", "1000")

We are working on improving this by building a generic mechanism into the
Streaming DataSource V2 so that the engine can do admission control on the
amount of data returned in a source independent way.

On Tue, Mar 20, 2018 at 2:58 PM, kant kodali  wrote:

> I am using spark 2.3.0 and Kafka 0.10.2.0 so I assume structured streaming
> using Direct API's although I am not sure? If it is direct API's the only
> parameters that are relevant are below according to this
> 
> article
>
>- spark.conf("spark.streaming.backpressure.enabled", "true")
>- spark.conf("spark.streaming.kafka.maxRatePerPartition", "1")
>
> I set both of these and I run select count * on my 10M records I still
> don't see any output until it finishes the initial batch of 10M and this
> takes a while. so I am wondering if I miss something here?
>
> On Tue, Mar 20, 2018 at 6:09 AM, Geoff Von Allmen 
> wrote:
>
>> The following
>>  
>> settings
>> may be what you’re looking for:
>>
>>- spark.streaming.backpressure.enabled
>>- spark.streaming.backpressure.initialRate
>>- spark.streaming.receiver.maxRate
>>- spark.streaming.kafka.maxRatePerPartition
>>
>> 
>>
>> On Mon, Mar 19, 2018 at 5:27 PM, kant kodali  wrote:
>>
>>> Yes it indeed makes sense! Is there a way to get incremental counts when
>>> I start from 0 and go through 10M records? perhaps count for every micro
>>> batch or something?
>>>
>>> On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen <
>>> ge...@ibleducation.com> wrote:
>>>
 Trigger does not mean report the current solution every 'trigger
 seconds'. It means it will attempt to fetch new data and process it no
 faster than trigger seconds intervals.

 If you're reading from the beginning and you've got 10M entries in
 kafka, it's likely pulling everything down then processing it completely
 and giving you an initial output. From here on out, it will check kafka
 every 1 second for new data and process it, showing you only the updated
 rows. So the initial read will give you the entire output since there is
 nothing to be 'updating' from. If you add data to kafka now that the
 streaming job has completed it's first batch (and leave it running), it
 will then show you the new/updated rows since the last batch every 1 second
 (assuming it can fetch + process in that time span).

 If the combined fetch + processing time is > the trigger time, you will
 notice warnings that it is 'falling behind' (I forget the exact verbiage,
 but something to the effect of the calculation took XX time and is falling
 behind). In that case, it will immediately check kafka for new messages and
 begin processing the next batch (if new messages exist).

 Hope that makes sense -

 On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:

> Hi All,
>
> I have 10 million records in my Kafka and I am just trying to
> spark.sql(select count(*) from kafka_view). I am reading from kafka and
> writing to kafka.
>
> My writeStream is set to "update" mode and trigger interval of one
> second (Trigger.ProcessingTime(1000)). I expect the counts to be
> printed every second but looks like it would print after going through all
> 10M. why?
>
> Also, it seems to take forever whereas Linux wc of 10M rows would take
> 30 seconds.
>
> Thanks!
>

>>>
>>
>

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak

Ok, this one works:

.withColumn("hour", hour(from_unixtime(typedDataset.col("ts") / 1000)))



2018-03-20 22:43 GMT+01:00 Serega Sheypak :

> Hi, any updates? Looks like some API inconsistency or bug..?
>
> 2018-03-17 13:09 GMT+01:00 Serega Sheypak :
>
>> > Not sure why you are dividing by 1000. from_unixtime expects a long type
>> It expects seconds, I have milliseconds.
>>
>>
>>
>> 2018-03-12 6:16 GMT+01:00 vermanurag :
>>
>>> Not sure why you are dividing by 1000. from_unixtime expects a long type
>>> which is time in milliseconds from reference date.
>>>
>>> The following should work:
>>>
>>> val ds = dataset.withColumn("hour",hour(from_unixtime(dataset.col("ts
>>> "
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali

I am using spark 2.3.0 and Kafka 0.10.2.0 so I assume structured streaming
using Direct API's although I am not sure? If it is direct API's the only
parameters that are relevant are below according to this

article

   - spark.conf("spark.streaming.backpressure.enabled", "true")
   - spark.conf("spark.streaming.kafka.maxRatePerPartition", "1")

I set both of these and I run select count * on my 10M records I still
don't see any output until it finishes the initial batch of 10M and this
takes a while. so I am wondering if I miss something here?

On Tue, Mar 20, 2018 at 6:09 AM, Geoff Von Allmen 
wrote:

> The following
>  
> settings
> may be what you’re looking for:
>
>- spark.streaming.backpressure.enabled
>- spark.streaming.backpressure.initialRate
>- spark.streaming.receiver.maxRate
>- spark.streaming.kafka.maxRatePerPartition
>
> 
>
> On Mon, Mar 19, 2018 at 5:27 PM, kant kodali  wrote:
>
>> Yes it indeed makes sense! Is there a way to get incremental counts when
>> I start from 0 and go through 10M records? perhaps count for every micro
>> batch or something?
>>
>> On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen > > wrote:
>>
>>> Trigger does not mean report the current solution every 'trigger
>>> seconds'. It means it will attempt to fetch new data and process it no
>>> faster than trigger seconds intervals.
>>>
>>> If you're reading from the beginning and you've got 10M entries in
>>> kafka, it's likely pulling everything down then processing it completely
>>> and giving you an initial output. From here on out, it will check kafka
>>> every 1 second for new data and process it, showing you only the updated
>>> rows. So the initial read will give you the entire output since there is
>>> nothing to be 'updating' from. If you add data to kafka now that the
>>> streaming job has completed it's first batch (and leave it running), it
>>> will then show you the new/updated rows since the last batch every 1 second
>>> (assuming it can fetch + process in that time span).
>>>
>>> If the combined fetch + processing time is > the trigger time, you will
>>> notice warnings that it is 'falling behind' (I forget the exact verbiage,
>>> but something to the effect of the calculation took XX time and is falling
>>> behind). In that case, it will immediately check kafka for new messages and
>>> begin processing the next batch (if new messages exist).
>>>
>>> Hope that makes sense -
>>>
>>>
>>> On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:
>>>
 Hi All,

 I have 10 million records in my Kafka and I am just trying to
 spark.sql(select count(*) from kafka_view). I am reading from kafka and
 writing to kafka.

 My writeStream is set to "update" mode and trigger interval of one
 second (Trigger.ProcessingTime(1000)). I expect the counts to be
 printed every second but looks like it would print after going through all
 10M. why?

 Also, it seems to take forever whereas Linux wc of 10M rows would take
 30 seconds.

 Thanks!

>>>
>>
>

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak

Hi, any updates? Looks like some API inconsistency or bug..?

2018-03-17 13:09 GMT+01:00 Serega Sheypak :

> > Not sure why you are dividing by 1000. from_unixtime expects a long type
> It expects seconds, I have milliseconds.
>
>
>
> 2018-03-12 6:16 GMT+01:00 vermanurag :
>
>> Not sure why you are dividing by 1000. from_unixtime expects a long type
>> which is time in milliseconds from reference date.
>>
>> The following should work:
>>
>> val ds = dataset.withColumn("hour",hour(from_unixtime(dataset.col("
>> ts"
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: [Structured Streaming] Query Metrics to MetricsSink

2018-03-20 Thread lucas-vsco

It actually looks like I might have the answers via these following links:

[Design] Metrics in Structured Streaming

  

JIRA - Structured Streaming - Metrics

  

Thanks.







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Access Table with Spark Dataframe

2018-03-20 Thread hemant singh

See if this helps -
https://stackoverflow.com/questions/42852659/makiing-sql-request-on-columns-containing-dot
enclosing column names in "`"

On Tue, Mar 20, 2018 at 6:47 PM, SNEHASISH DUTTA 
wrote:

> Hi,
>
> I am using Spark 2.2 , a table fetched from database contains a (.) dot in
> one of the column names.
> Whenever I am trying to select that particular column I am getting query
> analysis exception.
>
>
> I have tried creating a temporary table , using createOrReplaceTempView()
> and fetch the column's data but same was the outcome.
>
> How can this ('.') be escaped,while querying ?
>
>
> Thanks and Regards,
> Snehasish
>

[Structured Streaming] Query Metrics to MetricsSink

2018-03-20 Thread lucas-vsco

I am looking to take the metrics exposed in the logs from MicroBatchExecution
below and submit them as stats to implemented MetricsSinks.

2018-03-20 10:28:48 INFO  MicroBatchExecution:54 - Streaming query made
progress: {
  "id" : "42bb5c95-980d-480d-9dee-72e1baf6a5b3",
  "runId" : "d9209988-7715-47f6-8845-1ea8208ecd9e",
  "name" : null,
  "timestamp" : "2018-03-20T14:28:45.074Z",
  "batchId" : 1,
  "numInputRows" : 10,
  "inputRowsPerSecond" : 909.0909090909091,
  "processedRowsPerSecond" : 2.8019052956010086,
  "durationMs" : {
"addBatch" : 3462,
"getBatch" : 4,
"getOffset" : 2,
"queryPlanning" : 33,
"triggerExecution" : 3569,
"walCommit" : 64
  },
  "stateOperators" : [ {
"numRowsTotal" : 10,
"numRowsUpdated" : 8,
"memoryUsedBytes" : 27191
  } ],
  "sources" : [ {
"description" : "KafkaSource[Subscribe[image-enrichment.test]]",
"startOffset" : {
  "image-enrichment.test" : {
"0" : 7970
  }
},
"endOffset" : {
  "image-enrichment.test" : {
"0" : 7980
  }
},
"numInputRows" : 10,
"inputRowsPerSecond" : 909.0909090909091,
"processedRowsPerSecond" : 2.8019052956010086
  } ],
  "sink" : {
"description" :
"org.apache.spark.sql.kafka010.KafkaSourceProvider@3263da34"
  }
}

Is there a way to write a custom wrapper to handle this?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-03-20 Thread dcam

I'm just circling back to this now. Is the commit protocol an acceptable way
of making this configureable? I could make the temp path (currently
"_temporary") configureable if that is what you are referring to.


Michael Armbrust wrote
> We didn't go this way initially because it doesn't work on storage systems
> that have weaker guarantees than HDFS with respect to rename.  That said,
> I'm happy to look at other options if we want to make this configurable.
> 
>> After hesitating for a while, I wrote a custom commit protocol to solve
>> the problem. It combines HadoopMapReduceCommitProtocol's behavior of
>> writing to a temp file first, with ManifestFileCommitProtocol. From what
>> I can tell ManifestFileCommitProtocol is required for the normal
>> Structured
>> Streaming behavior of being able to resume streams from a known point.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Access Table with Spark Dataframe

2018-03-20 Thread SNEHASISH DUTTA

Hi,

I am using Spark 2.2 , a table fetched from database contains a (.) dot in
one of the column names.
Whenever I am trying to select that particular column I am getting query
analysis exception.


I have tried creating a temporary table , using createOrReplaceTempView()
and fetch the column's data but same was the outcome.

How can this ('.') be escaped,while querying ?


Thanks and Regards,
Snehasish

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Geoff Von Allmen

The following

settings
may be what you’re looking for:

   - spark.streaming.backpressure.enabled
   - spark.streaming.backpressure.initialRate
   - spark.streaming.receiver.maxRate
   - spark.streaming.kafka.maxRatePerPartition



On Mon, Mar 19, 2018 at 5:27 PM, kant kodali  wrote:

> Yes it indeed makes sense! Is there a way to get incremental counts when I
> start from 0 and go through 10M records? perhaps count for every micro
> batch or something?
>
> On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen 
> wrote:
>
>> Trigger does not mean report the current solution every 'trigger
>> seconds'. It means it will attempt to fetch new data and process it no
>> faster than trigger seconds intervals.
>>
>> If you're reading from the beginning and you've got 10M entries in kafka,
>> it's likely pulling everything down then processing it completely and
>> giving you an initial output. From here on out, it will check kafka every 1
>> second for new data and process it, showing you only the updated rows. So
>> the initial read will give you the entire output since there is nothing to
>> be 'updating' from. If you add data to kafka now that the streaming job has
>> completed it's first batch (and leave it running), it will then show you
>> the new/updated rows since the last batch every 1 second (assuming it can
>> fetch + process in that time span).
>>
>> If the combined fetch + processing time is > the trigger time, you will
>> notice warnings that it is 'falling behind' (I forget the exact verbiage,
>> but something to the effect of the calculation took XX time and is falling
>> behind). In that case, it will immediately check kafka for new messages and
>> begin processing the next batch (if new messages exist).
>>
>> Hope that makes sense -
>>
>>
>> On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I have 10 million records in my Kafka and I am just trying to
>>> spark.sql(select count(*) from kafka_view). I am reading from kafka and
>>> writing to kafka.
>>>
>>> My writeStream is set to "update" mode and trigger interval of one
>>> second (Trigger.ProcessingTime(1000)). I expect the counts to be
>>> printed every second but looks like it would print after going through all
>>> 10M. why?
>>>
>>> Also, it seems to take forever whereas Linux wc of 10M rows would take
>>> 30 seconds.
>>>
>>> Thanks!
>>>
>>
>

the meaining of "samplePointsPerPartitionHint" in RangePartitioner

2018-03-20 Thread 1427357...@qq.com

HI  all,

The belowing is the code of RangePartitioner.
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true,
val samplePointsPerPartitionHint: Int = 20)
I feel puzzled about the samplePointsPerPartitionHint.
My issue is :
what is the samplePointsPerPartitionHint used for please?
If I set samplePointsPerPartitionHint as 100 or 20,what will happed please?

Thanks.

Robin Shao




1427357...@qq.com

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Jörn Franke

Write your own Spark UDF. Apply it to all varchar columns.

Within this udf you can use the SimpleDateFormat parse method. If this method 
returns null you return the content as varchar if not you return a date. If the 
content is null you return null.

Alternatively you can define an insert function as pl/sql on Oracle side.

Another alternative is to read the Oracle metadata for the table at runtime and 
then adapt your conversion based on this. 

However, this may not be perfect depending on your use case. Can you please 
provide more details/examples? Do you aim at a generic hive to Oracle import 
tool using Spark? Sqoop would not be an alternative?

> On 20. Mar 2018, at 03:45, Gurusamy Thirupathy  wrote:
> 
> Hi guha,
> 
> Thanks for your quick response, option a and b are in our table already. For 
> option b, again the same problem, we don't know which column is date.
> 
> 
> Thanks,
> -G
> 
>> On Sun, Mar 18, 2018 at 9:36 PM, Deepak Sharma  wrote:
>> The other approach would to write to temp table and then merge the data.
>> But this may be expensive solution.
>> 
>> Thanks
>> Deepak
>> 
>>> On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy  wrote:
>>> Hi,
>>> 
>>> I am trying to read data from Hive as DataFrame, then trying to write the 
>>> DF into the Oracle data base. In this case, the date field/column in hive 
>>> is with Type Varchar(20)
>>> but the corresponding column type in Oracle is Date. While reading from 
>>> hive , the hive table names are dynamically decided(read from another 
>>> table) based on some job condition(ex. Job1). There are multiple tables 
>>> like this, so column and the table names are decided only run time. So I 
>>> can't do type conversion explicitly when read from Hive.
>>> 
>>> So is there any utility/api available in Spark to achieve this conversion 
>>> issue?
>>> 
>>> 
>>> Thanks,
>>> Guru
> 
> 
> 
> -- 
> Thanks,
> Guru

strange behavior of joining dataframes

Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Re: Hive to Oracle using Spark - Type(Date) conversion issue

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

Re: how "hour" function in Spark SQL is supposed to work?

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

Re: how "hour" function in Spark SQL is supposed to work?

Re: [Structured Streaming] Query Metrics to MetricsSink

Re: Access Table with Spark Dataframe

[Structured Streaming] Query Metrics to MetricsSink

Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

Access Table with Spark Dataframe

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

the meaining of "samplePointsPerPartitionHint" in RangePartitioner

Re: Hive to Oracle using Spark - Type(Date) conversion issue

17 matches

Site Navigation

Mail list logo

Footer information