Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
You can use Amazon Glue for your use case
here are the
you can try dropduplicate function
https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
On 31 May 2018 at 16:34, wrote:
> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way
Hi ,
Here is example snippet in scala
// Convert to a Date typeval timestamp2datetype: (Column) => Column =
(x) => { to_date(x) }df = df.withColumn("date",
timestamp2datetype(col("end_date")))
Hope this helps !
Thanks,
Divya
On 28 March 2018 at 15:16, Junfeng Chen
Hi,
I am getting below error when creating Dataframe from twitter Streaming RDD
val sparkSession:SparkSession = SparkSession
.builder
.appName("twittertest2")
.master("local[*]")
.enableHiveSupport()
DataSource APIs to build streaming
> sources are not public yet, and are in flux.
>
> 2. Use Kafka/Kinesis as an intermediate system: Write something simple
> that uses Twitter APIs directly to read tweets and write them into
> Kafka/Kinesis. And then just read from Kafka/Kinesis
Hi ,
I see ,Does that means Spark structured streaming doesn't work with Twitter
streams ?
I could see people used kafka or other streaming tools and used spark to
process the data in structured streaming .
The below doesn't work directly with Twitter Stream until I set up Kafka ?
> import
Hi,
I am exploring the spark structured streaming .
When turned to internet to understand about it I could find its more
integrated with Kafka or other streaming tool like Kenesis.
I couldnt find where we can use Spark Streaming API for twitter streaming
data .
Would be grateful ,if any body used
Hi ,
I have a CDH cluster and running pyspark script in client mode
There are different python version installed in client and worker nodes and
was getting python version mismatch error.
To resolve this issue I followed below cludera document
Hi,
I have spark standalone cluster on AWS EC2 and recently my spark stream
jobs stopping
abrubptly.
When I check the logs I found this
17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote.
Handshake timed out or transport failure detector triggered.
17/03/07 06:09:39 ERROR
Hi ,
I am using EMR machine and I could see the Spark log directory has grown
till 4G.
file name : spark-history-server.out
Need advise how can I reduce the the size of the above mentioned file.
Is there config property which can help me .
Thanks,
Divya
Hi Mich ,
Have you set SPARK_CLASSPATH in Spark-env.sh ?
Thanks,
Divya
On 27 December 2016 at 17:33, Mich Talebzadeh
wrote:
> When one runs in Local mode (one JVM) on an edge host (the host user
> accesses the cluster), it is possible to put additional jar file
Hi Mich,
Can you try placing these jars in Spark Classpath.
It should work .
Thanks,
Divya
On 22 December 2016 at 05:40, Mich Talebzadeh
wrote:
> This works with Spark 2 with Oracle jar file added to
>
> $SPARK_HOME/conf/ spark-defaults.conf
>
>
>
>
>
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html
Hope this helps
Thanks,
Divya
On 15 December 2016 at 12:49, Milin korath
wrote:
> Hi
>
> I have a spark data frame with following structure
>
> id flag price date
> a 0
I am not pyspark person ..
But from the errors I could figure out that your Spark application is
having memory issues .
Are you collecting the results to the driver at any point of time or have
configured less memory for the nodes ?
and If you are using Dataframes then there is issue raised in
you can use udfs to do it
http://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe
Hope it will help.
Thanks,
Divya
On 9 December 2016 at 00:53, Anton Kravchenko
wrote:
> Hello,
>
> I wonder if there is a way (preferably
It depends on the use case ...
Spark always depends on the resource availability .
As long as you have resource to acoomodate ,can run as many spark/spark
streaming application.
Thanks,
Divya
On 15 December 2016 at 08:42, shyla deshpande
wrote:
> How many Spark
Hi,
You can use the Column functions provided by Spark API
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
Hope this helps .
Thanks,
Divya
On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote:
> Hi,
> I have a sample, like:
>
http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element
Hope this helps
Thanks,
Divya
On 19 October 2016 at 11:35, lk_spark wrote:
> hi,all:
> I want to read a json file and search it by sql .
> the data struct
Can you please elaborate your use case ?
On 18 October 2016 at 15:48, muhammet pakyürek wrote:
>
>
>
>
>
> --
> *From:* muhammet pakyürek
> *Sent:* Monday, October 17, 2016 11:51 AM
> *To:* user@spark.apache.org
> *Subject:*
If my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .
Thanks,
Divya
On 17 October 2016 at 09:50, Mungeol Heo wrote:
> Hello, everyone.
>
> As
Hi Mich ,
you can create dataframe from RDD in below manner also
val df = sqlContext.createDataFrame(rdd,schema)
val df = sqlContext.createDataFrame(rdd)
The below article also may help you :
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
On 11 October 2016 at
Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention as below :
InputFolder/batch-147495960
InputFolder/batch-147495990
InputFolder/batch-147496020
InputFolder/batch-147496050
InputFolder/batch-147496080
Hi,
I have initialised the logging in my spark App
/*Initialize Logging */
val log = Logger.getLogger(getClass.getName)
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
log.warn("Some text"+Somemap.size)
When I run my spark job in using spark-submit like
Spark version plz ?
On 21 September 2016 at 09:46, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:
> Yeah I can do all operations on that folder
>
> On Sep 21, 2016 12:15 AM, "Kevin Mellott"
> wrote:
>
>> Are you able to manually delete the folder below?
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is
val OOM=52 - i.e. an OutOfMemoryError
Refer
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala
On 19 September 2016 at 14:57,
Hi,
I am on EMR cluster and My cluster configuration is as below:
Number of nodes including master node - 3
Memory:22.50 GB
VCores Total : 16
Active Nodes : 2
Spark version- 1.6.1
Parameter set in spark-default.conf
spark.executor.instances 2
> spark.executor.cores 8
>
appreciate the help.
Thanks,
Divya
On 13 September 2016 at 15:09, Divya Gehlot <divya.htco...@gmail.com> wrote:
> Hi,
> Thanks all for your prompt response.
> I followed the instruction in the docs EMR SSH tunnel
> <https://docs.aws.amazon.com/ElasticMapReduce/latest/Ma
Hi,
Some how for time being I am unable to view Spark Web UI and Hadoop Web UI.
Looking for other ways ,I can check my job is running fine apart from keep
checking current yarn logs .
Thanks,
Divya
Hi,
I am on EMR 4.7 with Spark 1.6.1 and Hadoop 2.7.2
When I am trying to view Any of the web UI of the cluster either hadoop or
Spark ,I am getting below error
"
This site can’t be reached
"
Has anybody using EMR and able to view WebUI .
Could you please share the steps.
Would really
Hi,
I am on Spark 1.6.1
I am getting below error when I am trying to call UDF in my spark Dataframe
column
UDF
/* get the train line */
val deriveLineFunc :(String => String) = (str:String) => {
val build_key = str.split(",").toList
val getValue = if(build_key.length > 1)
Hi,
Is it necessary to import sqlContext.implicits._ whenever define and
call UDF in Spark.
Thanks,
Divya
Hi,
I am using EMR 4.7 with Spark 1.6
Sometimes when I start the spark shell I get below error
OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x0005662c, 10632822784, 0) failed; error='Cannot
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java
park Summit 2015
> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Tue, Sep 6, 2016 at 4:45 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>&
Hi,
I am getting below error if I try to use properties file paramater in
spark-submit
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
at
Hi,
I am on EMR 4.7 with Spark 1.6.1
I am trying to read from s3n buckets in spark
Option 1 :
If I set up
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env("AWS_SECRET_ACCESS_KEY"))
hadoopConf.set("fs.s3.awsAccessKeyId",
n Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
> > Hi,
> >
> > Would like to know the difference between the --package and --jars
> option in
> > Spark .
> >
> >
> >
> > Thanks,
> > Divya
>
385981/how-to-access-s3a-files-from-apache-spark
Is it really the issue ?
Could somebody help me validate the above ?
Thanks,
Divya
On 1 September 2016 at 16:59, Steve Loughran <ste...@hortonworks.com> wrote:
>
> On 1 Sep 2016, at 03:45, Divya Gehlot <divya.htco...@gmail.com> wrote
Hi,
Would like to know the difference between the --package and --jars option
in Spark .
Thanks,
Divya
Hi Saurabh,
Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it
threw the same error .
So have to create HiveContext to access windows function
Thanks,
Divya
On 1 September 2016 at 13:16, saurabh3d wrote:
> Hi All,
>
> As per SPARK-11001
Which java version are you using ?
On 31 August 2016 at 04:30, Diwakar Dhanuskodi wrote:
> Hi,
>
> While building Spark 1.6.2 , getting below error in spark-sql. Much
> appreciate for any help.
>
> ERROR] missing or invalid dependency detected while loading class
Hi,
I am using Spark 1.6.1 in EMR machine
I am trying to read s3 buckets in my Spark job .
When I read it through Spark shell I am able to read it ,but when I try to
package the job and and run it as spark submit I am getting below error
16/08/31 07:36:38 INFO ApplicationMaster: Registered signal
Can you please check order of all the data set of union all operations.
Are they in same order ?
On 9 August 2016 at 02:47, max square wrote:
> Hey guys,
>
> I'm trying to save Dataframe in CSV format after performing unionAll
> operations on it.
> But I get this
Hi,
I have column values having values like
Value
30
12
56
23
12
16
12
89
12
5
6
4
8
I need create another column
if col("value") > 30 1 else col("value") < 30
newColValue
0
1
0
1
2
3
4
0
1
2
3
4
5
How can I have create an increment column
The grouping is happening based on some other cols
; https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. An
Hi,
I have use case where I need to use or[||] operator in filter condition.
It seems its not working its taking the condition before the operator and
ignoring the other filter condition after or operator.
As any body faced similar issue .
Psuedo code :
ri, Aug 5, 2016 at 12:16 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi,
>> I am working with Spark 1.6 with scala and using Dataframe API .
>> I have a use case where I need to compare two rows and add entry in the
>> new column based on the lo
Hi,
I am working with Spark 1.6 with scala and using Dataframe API .
I have a use case where I need to compare two rows and add entry in the
new column based on the lookup table
for example :
My DF looks like :
col1col2 newCol1
street1 person1
street2 person1
Hi,
Has anybody has worked with GraphFrames.
Pls let me know as I need to know the real case scenarios where It can used
.
Thanks,
Divya
eUtil.html#fullyDelete(java.io.File)
>
> On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Resending to right list
>> ------ Forwarded message --
>> From: "Divya Gehlot" <divya.htco...@gmail.c
Hi,
When I am doing the using theFileUtil.copymerge function
val file = "/tmp/primaryTypes.csv"
FileUtil.fullyDelete(new File(file))
val destinationFile= "/tmp/singlePrimaryTypes.csv"
FileUtil.fullyDelete(new File(destinationFile))
val counts = partitions.
reduceByKey {case (x,y) => x +
Hi,
I am getting below error when I am trying to save dataframe using Spark-CSV
>
> final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path)
java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at
>
Hi,
Can somebody help me by creating the dataframe column from the scala list .
Would really appreciate the help .
Thanks ,
Divya
Hi,
I need to add 8 hours to from_unixtimestamp
df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time"
I am try to joda time function
def unixToDateTime (unix_timestamp : String) : DateTime = {
val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours
return utcTS
}
I have a dataset of time as shown below :
Time1
07:30:23
07:34:34
07:38:23
07:39:12
07:45:20
I need to find the diff between two consecutive rows
I googled and found the *lag *function in *spark *helps in finding it .
but its giving me *null *in the result set.
Would really appreciate the help.
Hi,
val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') -
lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table");
Instead of time difference in seconds I am gettng null .
Would reay appreciate the help.
Thanks,
Divya
2 AM
> To: Rabin Banerjee <dev.rabin.baner...@gmail.com>
> Cc: Divya Gehlot <divya.htco...@gmail.com>, "user @spark" <
> user@spark.apache.org>
> Subject: Re: write and call UDF in spark dataframe
>
> Hi Divya,
>
> There is already "from_unixtime&qu
Hi,
I have a dataset of time as shown below :
Time1
07:30:23
07:34:34
07:38:23
07:39:12
07:45:20
I need to find the diff between two consecutive rows
I googled and found the *lag *function in *spark *helps in finding it .
but its not giving me *null *in the result set.
Would really appreciate
Hi,
Could somebody share example of writing and calling udf which converts unix
tme stamp to date tiime .
Thanks,
Divya
hanks,
Divya
On 18 July 2016 at 23:06, Jacek Laskowski <ja...@japila.pl> wrote:
> See broadcast variable.
>
> Or (just a thought) do join between DataFrames.
>
> Jacek
>
> On 18 Jul 2016 9:24 a.m., "Divya Gehlot" <divya.htco...@gmail.com> wrote:
&g
Hi,
I have created a map by reading a text file
val keyValueMap = file_read.map(t => t.getString(0) ->
t.getString(4)).collect().toMap
Now I have another dataframe where I need to dynamically replace all the
keys of Map with values
val df_input = reading the file as dataframe
val df_replacekeys
Hi,
I have huge data set like similar below :
timestamp,fieldid,point_id
1468564189,89,1
1468564090,76,4
1468304090,89,9
1468304090,54,6
1468304090,54,4
Have configuration file of consecutive points --
1,9
4,6
like 1 and 9 are consecutive points similarly 4,6 are consecutive points
Now I need
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"),
"fullouter").drop(df1("Id"))
On May 18, 2016 2:16 PM, "ram kumar" wrote:
I tried
scala> var df_join = df1.join(df2, "Id", "fullouter")
:27: error: type mismatch;
found : String("Id")
required:
Hi,
I am using Spark 1.5.2 with Apache Phoenix 4.4
As Spark 1.5.2 doesn't support subquery in where conditions .
https://issues.apache.org/jira/browse/SPARK-4226
Is there any alternative way to find foreign key constraints.
Would really appreciate the help.
Thanks,
Divya
Hi,
I would like to know the uses cases where data frames is best fit and use
cases where Spark SQL is best fit based on the one's experience .
Thanks,
Divya
Hi,
I just stumbled upon some data quality check package for spark
https://github.com/FRosner/drunken-data-quality
Has any body used it ?
Would really appreciate the feedback .
Thanks,
Divya
, Ted Yu <yuzhih...@gmail.com> wrote:
> I am afraid there is no such API.
>
> When persisting, you can specify StorageLevel :
>
> def persist(newLevel: StorageLevel): this.type = {
>
> Can you tell us your use case ?
>
> Thanks
>
> On Thu, May 5, 20
Hi,
How can I get and set storage level for Dataframes like RDDs ,
as mentioned in following book links
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
Thanks,
Divya
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
I am looking for something similar to above solution .
-- Forwarded message --
From: "Divya Gehlot" <divya.htco...@gmail.com>
Date: May 5, 2016 6:51 PM
Subject:
Hi,
Is there any package or project in Spark/scala which supports Data Quality
check?
For instance checking null values , foreign key constraint
Would really appreciate ,if somebody has already done it and happy to share
or has any open source package .
Thanks,
Divya
Divya
On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote:
> Check your javac version, and update it.
>
>
> -- 原始邮件 ------
> *发件人:* "Divya Gehlot";<divya.htco...@gmail.com>;
> *发送时间:* 2016年5月4日(星期三) 中午11:25
>
Hi ,
Even I am getting the similar error
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
When I tried to build Phoenix Project using maven .
Maven version : 3.3
Java version - 1.7_67
Phoenix - downloaded latest master from Git hub
If anybody find the the resolution
Hi,
I am interested to know on which parameters we can say Spark data frames
are better sql queries .
Would be grateful ,If somebody can explain me with the usecases .
Thanks,
Divya
more evenly.
>
> 2016-04-25 9:34 GMT+07:00 Divya Gehlot <divya.htco...@gmail.com>:
>
>> Hi,
>>
>> After joining two dataframes, saving dataframe using Spark CSV.
>> But all the result data is being written to only one part file whereas
>> there are 200 p
;a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
>> df2("b").as("2-b"))
>> val df4 = df3.join(df2,
Hi,
I am using Spark 1.5.2 and defined below udf
import org.apache.spark.sql.functions.udf
> val myUdf = (wgts : Int , amnt :Float) => {
> (wgts*amnt)/100.asInstanceOf[Float]
> }
>
val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts,
FloatType,col("RATE"),col("AMOUNT")))
In my
yes you can remove the headers by removing the first row
can first() or head() to do that
Thanks,
Divya
On 27 April 2016 at 13:24, Ashutosh Kumar wrote:
> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it
d this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
Hi,
I am using Spark 1.5.2 .
I have a use case where I need to join the same dataframe twice on two
different columns.
I am getting error missing Columns
For instance ,
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")
Is the bug or valid
Hi,
After joining two dataframes, saving dataframe using Spark CSV.
But all the result data is being written to only one part file whereas
there are 200 part files being created, rest 199 part files are empty.
What is the cause of uneven partitioning ? How can I evenly distribute the
data ?
Easy way of doing it
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
On 22 April 2016 at 11:51, Naveen Kumar Pokala
wrote:
> Hi,
>
>
>
> Do we have any way to perform Row level operations in spark dataframes.
>
>
>
>
>
> For example,
>
>
>
> I have
Hi,
I am using Spark with Hadoop 2.7 cluster
I need to print all my print statement and or any errors to file for
instance some info if passed some level or some error if something misisng
in my Spark Scala Script.
Can some body help me or redirect me tutorial,blog, books .
Whats the best way to
Hi,
I tried configuring logs to write it to file for Spark Driver and
Executors .
I have two separate log4j properties files for Spark driver and executor
respectively.
Its wrtiting log for Spark driver but for executor logs I am getting below
error :
java.io.FileNotFoundException:
Reposting again as unable to find the root cause where things are going
wrong.
Experts please help .
-- Forwarded message --
From: Divya Gehlot <divya.htco...@gmail.com>
Date: 15 April 2016 at 19:13
Subject: [Help]:Strange Issue :Debug Spark Dataframe code
To: "
Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .
I am facing strange issue .
I have a lookuo dataframe and using it join another dataframe on different
columns .
I am getting *Analysis exception* in third join.
When
Hi,
I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with
exit code 143 .
except one job where I am using unionAll and groupBy operation on multiple
columns .
Please advice me the options to optimize it .
The one option which I am using it now
--conf
Hi,
I would like to know does Spark Dataframe API has limit on creation of
number of columns?
Thanks,
Divya
Hi,
I hava a Hortonworks Hadoop cluster having below Configurations :
Spark 1.5.2
HBASE 1.1.x
Phoenix 4.4
I am able to connect to Phoenix through JDBC connection and able to read
the Phoenix tables .
But while writing the data back to Phoenix table
I am getting below error :
Forgot to mention
I am using all DataFrame API instead of sqls to the operations
-- Forwarded message --
From: Divya Gehlot <divya.htco...@gmail.com>
Date: 1 April 2016 at 18:35
Subject: [Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix
both
To: "user @s
[image: Mic Drop]
Hi,
I have Hadoop Hortonworks 3 NODE Cluster on EC2 with
*Hadoop *version 2.7.x
*Spark *version - 1.5.2
*Phoenix *version - 4.4
*Hbase *version 1.1.x
*Cluster Statistics *
Date Node 1
OS: redhat7 (x86_64)Cores (CPU): 2 (2)Disk: 20.69GB/99.99GB (20.69% used)
Memory: 7.39GB
Date
Hi,
The Spark set up is on Hadoop cluster.
How can I set up the Spark timezone to sync with Server Timezone ?
Any idea?
Thanks,
Divya
Hi Jacek ,
The difference is being mentioned in Spark doc itself
Note that if you perform a self-join using this function without aliasing
the input
* [[DataFrame]]s, you will NOT be able to reference any columns after the
join, since
* there is no way to disambiguate which side of the join you
Hi,
I have a map collection .
I am trying to build when condition based on the key values .
Like df.withColumn("ID", when( condition with map keys ,values of map )
How can I do that dynamically.
Currently I am iterating over keysIterator and get the values
Kal keys = myMap.keysIterator.toArray
Hi,
I am using Spark1.5.2
My requirement is as below
df.withColumn("NoOfDays",lit(datediff(df("Start_date"),df("end_date"
Now have to add one more columnn where my datediff(Start_date,end_date))
should match with map keys
Map looks like MyMap(1->1D,2->2D,3->3M,4->4W)
I want to do
Oh my my I am so silly
I can declare it as string and cast it to date
My apologies for Spamming the mailing list.
Thanks,
Divya
On 21 March 2016 at 14:51, Divya Gehlot <divya.htco...@gmail.com> wrote:
> Hi,
> In Spark 1.5.2
> Do we have any utiility which converts a constant
Hi,
In Spark 1.5.2
Do we have any utiility which converts a constant value as shown below
orcan we declare a date variable like val start_date :Date = "2015-03-02"
val start_date = "2015-03-02" toDate
like how we convert to toInt ,toString
I searched for it but couldnt find it
Thanks,
Divya
I have a time stamping table which has data like
No of Days ID
11D
22D
and so on till 30 days
Have another Dataframe with
start date and end date
I need to get the difference between these two days and get the ID from
Time Stamping table and do With Column .
Hi,
I am adding a new column and renaming it at same time but the renaming
doesnt have any effect.
dffiltered =
>
Hi,
I am dynamically doing union all and adding new column too
val dfresult =
> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9")
> val schemaL = dfresult.schema
> var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL)
> for ((key,values) <- lcrMap)
Hi,
As I cant add colmns from another Dataframe
I am planning to my row coulmns to map of key and arrays
As I am new to scala and spark
I am trying like below
// create an empty map
import scala.collection.mutable.{ArrayBuffer => mArrayBuffer}
var map = Map[Int,mArrayBuffer[Any]]()
def
Hi,
Can somebody point how can I confgure custom logs for my Spark (scala
scripts)
So that I can at which level my script failed and why ?
Thanks,
Divya
1 - 100 of 163 matches
Mail list logo