Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 . I can contribute to it as well . On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage wrote: > +1 > > Thanks for proposing > > On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >>

Re: Online classes for spark topics

2023-03-08 Thread Deepak Sharma
I can prepare some topics and present as well , if we have a prioritised list of topics already . On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee wrote: > We used to run Spark webinars on the Apache Spark LinkedIn group > but >

Re: Spark Issue with Istio in Distributed Mode

2022-09-12 Thread Deepak Sharma
oy-v3-api-field-config-core-v3-httpprotocoloptions-idle-timeout > > > On Sat, Sep 3, 2022 at 4:23 AM Deepak Sharma > wrote: > >> Thank for the reply IIan . >> Can we set this in spark conf or does it need to goto istio / envoy conf? >> >> >>

Re: Spark Issue with Istio in Distributed Mode

2022-09-03 Thread Deepak Sharma
at 12:17 AM Deepak Sharma > wrote: > >> Hi All, >> In 1 of our cluster , we enabled Istio where spark is running in >> distributed mode. >> Spark works fine when we run it with Istio in standalone mode. >> In spark distributed mode , we are seeing that every 1 hou

Spark Issue with Istio in Distributed Mode

2022-09-02 Thread Deepak Sharma
Hi All, In 1 of our cluster , we enabled Istio where spark is running in distributed mode. Spark works fine when we run it with Istio in standalone mode. In spark distributed mode , we are seeing that every 1 hour or so the workers are getting disassociated from master and then master is not able

Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma
It will spill to disk if everything can’t be loaded in memory . On Wed, 22 Jun 2022 at 5:58 PM, Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it

Re: spark as data warehouse?

2022-03-25 Thread Deepak Sharma
It can be used as warehouse but then you have to keep long running spark jobs. This can be possible using cached data frames or dataset . Thanks Deepak On Sat, 26 Mar 2022 at 5:56 AM, wrote: > In the past time we have been using hive for building the data > warehouse. > Do you think if spark

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Deepak Sharma
coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du wrote: > I have some PySpark code like below. Basically, I persist a DataFrame > (which is time-consuming to compute) to disk, call the method > DataFrame.count to trigger

Re: Profiling spark application

2022-01-19 Thread Deepak Sharma
You can take a look at jvm profiler that was open sourced by uber: https://github.com/uber-common/jvm-profiler On Thu, Jan 20, 2022 at 11:20 AM Prasad Bhalerao < prasadbhalerao1...@gmail.com> wrote: > Hi, > > It will require code changes and I am looking at some third party code , I > am

Re: Edge AI with Spark

2020-09-24 Thread Deepak Sharma
Near edge would work in this case. On Edge doesn't makes much sense , specially if its distributed processing framework such as spark. On Thu, Sep 24, 2020 at 3:12 PM Gourav Sengupta wrote: > hi, > > its better to use lighter frameworks over edge. Some of the edge devices I > work on run at

Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi Is there any design pattern around writing to the same hdfs directory from multiple spark jobs? -- Thanks Deepak www.bigdatabig.com

GroupBy issue while running K-Means - Dataframe

2020-06-16 Thread Deepak Sharma
Hi All, I have a custom implementation of K-Means where it needs the data to be grouped by a key in a dataframe. Now there is a big data skew for some of the keys , where it exceeds the BufferHolder: Cannot grow BufferHolder by size 17112 because the size after growing exceeds size limitation

Re: On spam messages

2020-04-29 Thread Deepak Sharma
Much appreciated Sean. Thanks. On Wed, 29 Apr 2020 at 6:48 PM, Sean Owen wrote: > I am subscribed to this list to watch for a certain person's new > accounts, which are posting obviously off-topic and inappropriate > messages. It goes without saying that this is unacceptable and a CoC >

Unsubscribe

2020-04-29 Thread Deepak Sharma
-- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: OFFICIAL USA REPORT TODAY India Most Dangerous : USA Religious Freedom Report out TODAY

2020-04-29 Thread Deepak Sharma
ccount is hacked ? > > On Wed, Apr 29, 2020, 11:56 AM Zahid Amin wrote: > >> How can it be rumours ? >> Of course you want to suppress me. >> Suppress USA official Report out TODAY . >> >> > Sent: Wednesday, April 29, 2020 at 8:17 AM >> > Fr

Re: India Most Dangerous : USA Religious Freedom Report

2020-04-29 Thread Deepak Sharma
Can someone block this email ? He is spreading rumours and spamming. On Wed, 29 Apr 2020 at 11:46 AM, Zahid Amin wrote: > USA report states that India is now the most dangerous country for Ethnic > Minorities. > > Remember Martin Luther King. > > >

unsubscribe

2019-12-07 Thread Deepak Sharma

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Deepak Sharma
Hi Schit PGP Encrypt is something that is not inbuilt with spark. I would suggest writing a shell script that would do pgp encrypt and use it in spark scala program , which would run from driver. Thanks Deepak On Mon, Aug 26, 2019 at 8:10 PM Sachit Murarka wrote: > Hi All, > > I want to

Re: A basic question

2019-06-17 Thread Deepak Sharma
You can follow this example: https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html On Mon, Jun 17, 2019 at 12:27 PM Shyam P wrote: > I am developing a spark job using java1.8v. > > Is it possible to write a spark app using spring-boot technology? > Did

Re: Read hdfs files in spark streaming

2019-06-10 Thread Deepak Sharma
Thanks All. I managed to get this working. Marking this thread as closed. On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma wrote: > This is the project requirement , where paths are being streamed in kafka > topic. > Seems it's not possible using spark structured streaming. > > &

Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
, > Vaquar khan > > On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > >> I am using spark streaming application to read from kafka. >> The value coming from kafka message is path to hdfs file. >> I am using spark 2.x , spark.read.stream. >> What is the best way

Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from kafka. The value coming from kafka message is path to hdfs file. I am using spark 2.x , spark.read.stream. What is the best way to read this path in spark streaming and then read the json stored at the hdfs path , may be using spark.read.json ,

Re: dynamic allocation in spark-shell

2019-05-31 Thread Deepak Sharma
You can start spark-shell with these properties: --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.maxExecutors=5 On Fri, May 31, 2019 at 5:30 AM Qian He wrote: > Sometimes

Re: Getting EOFFileException while reading from sequence file in spark

2019-04-29 Thread Deepak Sharma
This can happen if the file size is 0 On Mon, Apr 29, 2019 at 2:28 PM Prateek Rajput wrote: > Hi guys, > I am getting this strange error again and again while reading from from a > sequence file in spark. > User class threw exception: org.apache.spark.SparkException: Job aborted. > at >

Spark streaming filling the disk with logs

2019-02-13 Thread Deepak Sharma
Hi All I am running a spark streaming job with below configuration : --conf "spark.executor.extraJavaOptions=-Droot.logger=WARN,console" But it’s still filling the disk with info logs. If the logging level is set to WARN at cluster level , then only the WARN logs are getting written but then it

Error while upserting ElasticSearch from Spark 2.2

2018-10-08 Thread Deepak Sharma
Hi All, I am facing this weird issue while upserting ElasticSearch using Spark Data Frame. *org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception:* After it fails and if rerun for 2-3 times , it finally succeeds. I thought to check if anyone faced this issue and

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Deepak Sharma
Try this: *import **spark*.implicits._ df.toDF() On Wed, Sep 5, 2018 at 2:31 PM Mich Talebzadeh wrote: > With the following > > case class columns(KEY: String, TICKER: String, TIMEISSUED: String, PRICE: > Float) > > var key = line._2.split(',').view(0).toString > var ticker =

java.lang.IndexOutOfBoundsException: len is negative - when data size increases

2018-08-16 Thread Deepak Sharma
Hi All, I am running spark based ETL in spark 1.6 and facing this weird issue. The same code with same properties/configuration runs fine in other environment E.g. PROD but never completes in CAT. The only change would be the size of data it is processing and that too be by 1-2 GB. This is the

Re: Big data visualization

2018-05-27 Thread Deepak Sharma
Yes Amin Spark is primarily being used for ETL. Once you transform , you can store it in any nosql DBs that support use case. The BI dashboard app can further connect to the nosql DB for reports and visualization. HTH Deepak. On Mon, May 28, 2018, 05:47 amin mohebbi

Re: Help Required - Unable to run spark-submit on YARN client mode

2018-05-08 Thread Deepak Sharma
Can you try increasing the partition for the base RDD/dataframe that you are working on? On Tue, May 8, 2018 at 5:05 PM, Debabrata Ghosh wrote: > Hi Everyone, > I have been trying to run spark-shell in YARN client mode, but am getting > lot of ClosedChannelException

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Yes Nicolas. It would be great hell if you can push code to github and share URL. Thanks Deepak On Mon, Apr 23, 2018, 23:00 unk1102 wrote: > Hi Nicolas thanks much for guidance it was very useful information if you > can > push that code to github and share url it would

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Is there any open source code base to refer to for this kind of use case ? Thanks Deepak On Mon, Apr 23, 2018, 22:13 Nicolas Paris wrote: > Hi > > Problem is number of files on hadoop; > > > I deal with 50M pdf files. What I did is to put them in an avro table on > hdfs, >

Merge query using spark sql

2018-04-02 Thread Deepak Sharma
I am using spark to run merge query in postgres sql. The way its being done now is save the data to be merged in postgres as temp tables. Now run the merge queries in postgres using java sql connection and statment . So basically this query runs in postgres. The queries are insert into source

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma
The other approach would to write to temp table and then merge the data. But this may be expensive solution. Thanks Deepak On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy wrote: > Hi, > > I am trying to read data from Hive as DataFrame, then trying to write the > DF into

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Deepak Sharma
I would suggest repartioning it to reasonable partitions may ne 500 and save it to some intermediate working directory . Finally read all the files from this working dir and then coalesce as 1 and save to final location. Thanks Deepak On Fri, Mar 9, 2018, 20:12 Vadim Semenov

Re: HBase connector does not read ZK configuration from Spark session

2018-02-23 Thread Deepak Sharma
Hi Dharmin With the 1st approach , you will have to read the properties from the --files using this below: SparkFiles.get('file.txt') Or else , you can copy the file to hdfs , read it using sc.textFile and use the property within it. If you add files using --files , it gets copied to executor's

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
che.hadoop.util.RunJar.run(RunJar.java:221) > at org.apache.hadoop.util.RunJar.main(RunJar.java:136) > Feb 11, 2018 3:14:06 AM WARNING: parquet.hadoop.ParquetRecordReader: Can > not initialize counter due to context is not a instance of > TaskInputOutputContext, but is org.apache.hadoop.mapred

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
There was a typo: Instead of : alter table mine set locations "hdfs://localhost:8020/user/ hive/warehouse/mine"; Use : alter table mine set location "hdfs://localhost:8020/user/ hive/warehouse/mine"; On Sun, Feb 11, 2018 at 1:38 PM, Deepak Sharma <deepakmc...@g

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
4\",\" >> scale\":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10, >> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_ >> 2015\",\"scale\":2}},{\"name\":\"eq

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
etadata\":{\"name\":\"hu_ > site_deductible\",\"scale\":0}},{\"name\":\"fl_site_ > deductible\",\"type\":\"integer\",\"nullable\":true,\" > metadata\":{\"name\":\"fl_site_deductible\"

Re: Spark Dataframe and HIVE

2018-02-10 Thread Deepak Sharma
In hive cli: msck repair table 《table_name》; Thanks Deepak On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <ravishankar.n...@gmail.com> wrote: > NO, can you pease explain the command ? Let me try now. > > Best, > > On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma

Re: Spark Dataframe and HIVE

2018-02-10 Thread Deepak Sharma
I am not sure about the exact issue bjt i see you are partioning while writing from spark. Did you tried msck repair on the table before reading it in hive ? Thanks Deepak On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" wrote: > All, > > Thanks for the inputs.

CI/CD for spark and scala

2018-01-24 Thread Deepak Sharma
Hi All, I just wanted to check if there are any best practises around using CI/CD for spark / scala projects running on AWS hadoop clusters. IF there is any specific tools , please do let me know. -- Thanks Deepak

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Deepak Sharma
I am not sure about java but in scala it would be something like df.rdd.map{ x => MyClass(x.getString(0),.)} HTH --Deepak On Dec 19, 2017 09:25, "Sunitha Chennareddy" wrote: Hi All, I am new to Spark, I want to convert DataFrame to List with out using

Re: Spark based Data Warehouse

2017-11-13 Thread Deepak Sharma
os for your >> end users; but it sounds like you’ll be using it for exploratory analysis. >> Spark is great for this ☺ >> >> >> >> -Pat >> >> >> >> >> >> *From: *Vadim Semenov <vadim.seme...@datadoghq.com> >> *Date: *Su

Re: Spark based Data Warehouse

2017-11-11 Thread Deepak Sharma
I am looking for similar solution more aligned to data scientist group. The concern i have is about supporting complex aggregations at runtime . Thanks Deepak On Nov 12, 2017 12:51, "ashish rawat" wrote: > Hello Everyone, > > I was trying to understand if anyone here has

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma
I guess the issue is spark.default.parallelism is ignored when you are working with Data frames.It is supposed to work with only raw RDDs. Thanks Deepak On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda < noo...@noorul.com> wrote: > Hi all, > > I have the following spark

Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Deepak Sharma
df.rdd.foreach Thanks Deepak On Oct 26, 2017 18:07, "Noorul Islam Kamal Malmiyoda" wrote: > Hi all, > > I have a Dataframe with 1000 records. I want to split them into 100 > each and post to rest API. > > If it was RDD, I could use something like this > >

Re: Write to HDFS

2017-10-20 Thread Deepak Sharma
Better use coalesce instead of repatition On Fri, Oct 20, 2017 at 9:47 PM, Marco Mistroni wrote: > Use counts.repartition(1).save.. > Hth > > > On Oct 20, 2017 3:01 PM, "Uğur Sopaoğlu" wrote: > > Actually, when I run following code, > > val

Re: How can i split dataset to multi dataset

2017-08-06 Thread Deepak Sharma
This can be mapped as below: dataset.map(x=>((x(0),x(1),x(2)),x) This works with Dataframe of rows but i haven't tried with dataset Thanks Deepak On Mon, Aug 7, 2017 at 8:21 AM, Jone Zhang wrote: > val schema = StructType( > Seq( > StructField("app",

Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector , but am not able to. I am passing es.nodes and es.port along with es.nodes.wan.only set to true. But it fails with below error: 34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x failed to respond); no

Hive Context and SQL Context interoperability

2017-04-13 Thread Deepak Sharma
Hi All, I have registered temp tables using hive context and sql context both. Now when i try to join these 2 temp tables , 1 of the tables complain about not being found. Is there any setting or option so the tables in these 2 different contexts are visible to each other? -- Thanks Deepak

Re: Check if dataframe is empty

2017-03-07 Thread Deepak Sharma
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath wrote: > df.take(1).isEmpty should work My bad. It will return empty array: emptydf.take(1) res0: Array[org.apache.spark.sql.Row] = Array() and applying isEmpty would return boolean emptydf.take(1).isEmpty res2:

Re: Check if dataframe is empty

2017-03-06 Thread Deepak Sharma
If the df is empty , the .take would return java.util.NoSuchElementException. This can be done as below: df.rdd.isEmpty On Tue, Mar 7, 2017 at 9:33 AM, wrote: > Dataframe.take(1) is faster. > > > > *From:* ashaita...@nz.imshealth.com

Re: how to compare two avro format hive tables

2017-01-30 Thread Deepak Sharma
You can use spark testing base's rdd comparators. Create 2 different dataframes from these 2 hive tables. Convert them to rdd and use spark-testing-base compareRDD. Here is an example for rdd comparison: https://github.com/holdenk/spark-testing-base/wiki/RDDComparisons On Mon, Jan 30, 2017 at

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma
The better way is to read the data directly into spark using spark sql read jdbc . Apply the udf's locally . Then save the data frame back to Oracle using dataframe's write jdbc. Thanks Deepak On Jan 29, 2017 7:15 PM, "Jörn Franke" wrote: > One alternative could be the

Examples in graphx

2017-01-29 Thread Deepak Sharma
Hi There, Are there any examples of using GraphX along with any graph DB? I am looking to persist the graph in graph based DB and then read it back in spark , process using graphx. -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark sql or hive context ? Or do you want to reuse the existing UDF jar for hive in spark ? Thanks Deepak On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in

Re: anyone from bangalore wants to work on spark projects along with me

2017-01-19 Thread Deepak Sharma
Yes. I will be there before 4 PM . Whats your contact number ? Thanks Deepak On Thu, Jan 19, 2017 at 2:38 PM, Sirisha Cheruvu wrote: > Are we meeting today?! > > On Jan 18, 2017 8:32 AM, "Sirisha Cheruvu" wrote: > >> Hi , >> >> Just thought of keeping my

Re: Spark ANSI SQL Support

2017-01-17 Thread Deepak Sharma
>From spark documentation page: Spark SQL can now run all 99 TPC-DS queries. On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj" wrote: > Hi All, > > Does Spark 2.0 Sql support full ANSI SQL query standards? > > Thanks, > Rishabh. >

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
Did you tried this with spark-shell? Please try this. $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar On the spark shell: val hc = new org.apache.spark.sql.hive.HiveContext(sc) ; hc.sql("create temporary function nexr_nvl2 as ' com.nexr.platform.hive.udf.GenericUDFNVL2'");

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
On the sqlcontext or hivesqlcontext , you can register the function as udf below: *hiveSqlContext.udf.register("func_name",func(_:String))* Thanks Deepak On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu wrote: > Hey > > Can yu send me the source code of hive java udf which

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all respon

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
sclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 10:30, Deepak Sharma <deepakmc...@gmail.com> wrote: > >> It works for me with spark 1.6 (--jars) >> Please tr

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 09:52, Deepak S

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
Hi Mich You can copy the jar to shared location and use --jars command line argument of spark-submit. Who so ever needs access to this jar , can refer to the shared path and access it using --jars argument. Thanks Deepak On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame. Then iterate over all rows with map and use something like below: df.map(x=>x(0).toString().toDouble) Thanks Deepak On Tue, Dec 20, 2016 at 3:05 PM, big data wrote: > our source data are string-based data, like this: > col1

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
On Sun, Dec 18, 2016 at 2:26 AM, vaquar khan wrote: > select * from indexInfo; > Hi Vaquar I do not see CF with the name indexInfo in any of the cassandra databases. Thank Deepak -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
There are 8 worker nodes in the cluster . Thanks Deepak On Dec 18, 2016 2:15 AM, "Holden Karau" <hol...@pigscanfly.ca> wrote: > How many workers are in the cluster? > > On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma <deepakmc...@gmail.com> > wrote: > &

foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
Hi All, I am iterating over data frame's paritions using df.foreachPartition . Upon each iteration of row , i am initializing DAO to insert the row into cassandra. Each of these iteration takes almost 1 and half minute to finish. In my workflow , this is part of an action and 100 partitions are

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
2016 at 1:49 PM, Deepak Sharma <deepakmc...@gmail.com> wrote: > This is the correct way to do it.The timestamp that you mentioned was not > correct: > > scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd") > ts1: org.apache.spark.sql.Column =

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
01| |3bc61951-0f49-43b...|1477983725292|2016-11-01| |688acc61-753f-4a3...|1479899459947|2016-11-23| |5ff1eb6c-14ec-471...|1479901374026|2016-11-23| ++-+--+ Thanks Deepak On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma <deepakmc...@gmail.com> wrote: >

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is how you can do it in scala: scala> val ts1 = from_unixtime($"ts", "-MM-dd") ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd) scala> val finaldf = df.withColumn("ts1",ts1) finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string, ts1: string] scala>

Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-11-30 Thread Deepak Sharma
In Spark > 2.0 , spark session was introduced that you can use to query hive as well. Just make sure you create spark session with enableHiveSupport() option. Thanks Deepak On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande wrote: > I am Spark 2.0.2 , using DStreams

Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them? It may save you time. Thanks Deepak On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V wrote: > Hi all, > > I have 4 data frames with three columns, > > client_id,product_id,interest > > I want to combine these 4

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
amage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 11 November 2

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
This is waste of money I guess. On Nov 11, 2016 22:41, "Mich Talebzadeh" wrote: > starts at $4,000 per node per year all inclusive. > > With discount it can be halved but we are talking a node itself so if you > have 5 nodes in primary and 5 nodes in DR we are talking

Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
Hi Rohit You can use accumulators and increase it on every record processing. At last you can get the value of accumulator on driver , which will give you the count. HTH Deepak On Nov 5, 2016 20:09, "Rohit Verma" wrote: > I am using spark to read from database and

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
One of the current best from what I've worked with is >> Citus. >> >> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com> >> wrote: >> > Hi Cody >> > Spark direct stream is just fine for this use case. >> > But why post

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
i Akhtar <ali.rac...@gmail.com> wrote: > > Is there an advantage to that vs directly consuming from Kafka? Nothing > is > > being done to the data except some light ETL and then storing it in > > Cassandra > > > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma &l

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBx

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
h2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
;https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss,

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case class , if you are using scala. You can then use play api's (play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the mapped case class to json. Thanks Deepak On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your case. Multi tenancy means all the applications has to be submitted to different queues. Thanks Deepak On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot wrote: > Hi, > > I am on EMR cluster and My

Re: Ways to check Spark submit running

2016-09-13 Thread Deepak Sharma
Use yarn-client mode and you can see the logs n console after you submit. On Tue, Sep 13, 2016 at 11:47 AM, Divya Gehlot wrote: > Hi, > > Some how for time being I am unable to view Spark Web UI and Hadoop Web > UI. > Looking for other ways ,I can check my job is

Re: Assign values to existing column in SparkR

2016-09-09 Thread Deepak Sharma
Data frames are immutable in nature , so i don't think you can directly assign or change values on the column. Thanks Deepak On Fri, Sep 9, 2016 at 10:59 PM, xingye wrote: > I have some questions about assign values to a spark dataframe. I want to > assign values to an

Re: Calling udf in Spark

2016-09-08 Thread Deepak Sharma
No its not required for UDF. Its required when you convert from rdd to df. Thanks Deepak On 8 Sep 2016 2:25 pm, "Divya Gehlot" wrote: > Hi, > > Is it necessary to import sqlContext.implicits._ whenever define and > call UDF in Spark. > > > Thanks, > Divya > > >

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
Is it possible to execute any query using SQLContext even if the DB is secured using roles or tools such as Sentry? Thanks Deepak On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan wrote: > Hi All, > > In our YARN cluster, we have setup spark 1.6.1 , we plan to give

Re: Spark 2.0 - Join statement compile error

2016-08-23 Thread Deepak Sharma
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > *val* *df** = > **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" > =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* Ignore

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
Hi Subhajit Try this in your join: *val* *df** = **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* On Tue, Aug 23, 2016 at 2:30 AM, Subhajit Purkayastha wrote: > *All,* > > > > *I

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread DEEPAK SHARMA
a slides say that the default partitions is 2 however its 1 (looking at output of toDebugString). Appreciate any help. Thanks Deepak Sharma

Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi If anyone is using or knows about github repo that can help me get started with image and video processing using spark. The images/videos will be stored in s3 and i am planning to use s3 with Spark. In this case , how will spark achieve distributed processing? Any code base or references is

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Deepak Sharma
Can you please post the code snippet and the error you are getting ? -Deepak On 9 Aug 2016 12:18 am, "manish jaiswal" wrote: > Hi, > > I am not able to read data from hive transactional table using sparksql. > (i don't want read via hive jdbc) > > > > Please help. >

Re: Spark join and large temp files

2016-08-08 Thread Deepak Sharma
Register you dataframes as temp tables and then try the join on the temp table. This should resolve your issue. Thanks Deepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote: > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB)

Re: Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
rote: > I found following links are good as I am using same. > > http://spark.apache.org/docs/latest/tuning.html > > https://spark-summit.org/2014/testing-spark-best-practices/ > > Regards, > Vaquar khan > > On 8 Aug 2016 10:11, "Deepak Sharma" <deepakmc.

Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Hi All, Can anyone please give any documents that may be there around spark-scala best practises? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Deepak Sharma
Hi Devi Please make sure the jdbc jar is in the spark classpath. With spark-submit , you can use --jars option to specify the sql server jdbc jar. Thanks Deepak On Mon, Aug 8, 2016 at 1:14 PM, Devi P.V wrote: > Hi all, > > I am trying to write a spark dataframe into MS-Sql

  1   2   >