Re: Is Spark right for my use case?

2016-08-08 Thread Deepak Sharma
Hi Danellis For point 1 , spark streaming is something to look at. For point 2 , you can create DAO from cassandra on each stream processing.This may be costly operation though , but to do real time processing of data , you have to live with t. Point 3 is covered in point 2 above. Since you are

Long running tasks in stages

2016-08-06 Thread Deepak Sharma
I am doing join over 1 dataframe and a empty data frame. The first dataframe got almost 50k records. This operation nvere returns back and runs indefinitely. Is there any solution to get around this? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
ll config to overcome this. > Tried almost everything i could after searching online. > > Any help from the mailing list would be appreciated. > > On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> I am facing the same issue with spark 1.5

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
I am facing the same issue with spark 1.5.2 If the file size that's being processed by spark , is of size 10-12 MB , it throws out of memory . But if the same file is within 5 MB limit , it runs fine. I am using spark configuration with 7GB of memory and 3 cores for executors in the cluster of 8

Re: What are using Spark for

2016-08-02 Thread Deepak Sharma
Yes.I am using spark for ETL and I am sure there are lot of other companies who are using spark for ETL. Thanks Deepak On 2 Aug 2016 11:40 pm, "Rohit L" wrote: > Does anyone use Spark for ETL? > > On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal wrote:

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Deepak Sharma
I am using DAO in spark application to write the final computation to Cassandra and it performs well. What kinds of issues you foresee using DAO for hbase ? Thanks Deepak On 19 Jul 2016 10:04 pm, "Yu Wei" wrote: > Hi guys, > > > I write spark application and want to store

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-19 Thread Deepak Sharma
In spark streaming , you have to decide the duration of micro batches to run. Once you get the micro batch , transform it as per your logic and then you can use saveAsTextFiles on your final RDD to write it to HDFS. Thanks Deepak On 20 Jul 2016 9:49 am, wrote:

Re: RDD for loop vs foreach

2016-07-12 Thread Deepak Sharma
Hi Phil I guess for() is executed on the driver while foreach() will execute it in parallel. You can try this without collecting the rdd try both . foreach in this case would print on executors and you would not see anything on the driver console. Thanks Deepak On Tue, Jul 12, 2016 at 9:28 PM,

Re: Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread Deepak Sharma
Yes .You can do something like this : .map(x=>mapfunction(x)) Thanks Deepak On 9 Jul 2016 9:22 am, "charles li" wrote: > > hi, guys, is there a way to dynamic load files within the map function. > > i.e. > > Can I code as bellow: > > > ​ > > thanks a lot. > ​ > > > -- >

Re: One map per folder in spark or Hadoop

2016-07-07 Thread Deepak Sharma
You have to distribute the files in some distributed file system like hdfs. Or else copy the files to all executors local file system and make sure to mention the file scheme in the URI explicitly. Thanks Deepak On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A. wrote:

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
atic > write a size properly for what I already set in Alluxio 512MB per block. > > > On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > > Before writing coalesing your rdd to 1 . > It will create only 1 output file . > Multiple part file happens

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Deepak Sharma
Hi Ajay Looking at spark code , i can see you used hive context. Can you try using sql context instead of hive context there? Thanks Deepak On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander wrote: > Hi Mohit, > > Thanks for your time. Please find my response below. > > Did

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma
I am not sure if Spark provides any support for incremental extracts inherently. But you can maintain a file e.g. extractRange.conf in hdfs , to read from it the end range and update it with new end range from spark job before it finishes with the new relevant ranges to be used next time. On

Re: Accessing s3a files from Spark

2016-05-31 Thread Deepak Sharma
Hi Mayuresh Instead of s3a , have you tried the https:// uri for the same s3 bucket? HTH Deepak On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir wrote: > > > On Tue, May 31, 2016 at 5:29 AM, Steve Loughran > wrote: > >> which s3 endpoint? >> >> >

Re: Query related to spark cluster

2016-05-30 Thread Deepak Sharma
Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote: > > Well if

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi I have scala program consisting of spark core and spark streaming APIs Is there any open source tool that i can use to debug the program for performance reasons? My primary interest is to find the block of codes that would be exeuted on driver and what would go to the executors. Is there JMX

Re: Graceful shutdown of spark streaming on yarn

2016-05-13 Thread Deepak Sharma
dead and it shuts down abruptly. >> Could this issue be related to yarn? I see correct behavior locally. I >> did "yarn kill " to kill the job. >> >> >> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma <deepakmc...@gmail.com> >> wrote: >>

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
(Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Yes, it seems to be the case. > In this case executors should have continued logging values till 300, but > they are shutdown as soon as i do "yarn kill .." > > On Thu, May 12, 2016 at 12:11 PM Deepak Sharma

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
er$: VALUE -> 205 > 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206 > > > > > > > On Thu, May 12, 2016 at 11:45 AM Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Hi Rakesh >> Did you tried se

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
Hi Rakesh Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for your spark configuration instance? If not try this , and let us know if this helps. Thanks Deepak On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Issue i

Re: Setting Spark Worker Memory

2016-05-11 Thread Deepak Sharma
Since you are registering workers from the same node , do you have enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this node(11.14.224.24)? Thanks Deepak On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ wrote: > Hi All, > > I need to set same memory and

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
then apply > compression codec on it, save the rdd to another Hadoop cluster? > > Thank you, > Ajay > > On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote: > >> Hi Ajay >> You can look at wholeTextFiles method of rdd[string,string] and

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
Hi Ajay You can look at wholeTextFiles method of rdd[string,string] and then map each of rdd to saveAsTextFile . This will serve the purpose . I don't think if anything default like distcp exists in spark Thanks Deepak On 10 May 2016 11:27 pm, "Ajay Chander" wrote: > Hi

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Deepak Sharma
Spark 2.0 is yet to come out for public release. I am waiting to get hands on it as well. Please do let me know if i can download source and build spark2.0 from github. Thanks Deepak On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote: > Hi All, > > We are evaluating a

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Deepak Sharma
With Structured Streaming ,Spark would provide apis over spark sql engine. Its like once you have the structured stream and dataframe created out of this , you can do ad-hoc querying on the DF , which means you are actually querying the stram without having to store or transform. I have not used

Re: migration from Teradata to Spark SQL

2016-05-03 Thread Deepak Sharma
Hi Tapan I would suggest an architecture where you have different storage layer and data servng layer. Spark is still best for batch processing of data. So what i am suggesting here is you can have your data stored as it is in some hdfs raw layer , run your ELT in spark on this raw data and

Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Deepak Sharma
Hi all, I am looking for an architecture to ingest 10 mils of messages in the micro batches of seconds. If anyone has worked on similar kind of architecture , can you please point me to any documentation around the same like what should be the architecture , which all components/big data

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
as trying to > run big data stuff on windows. Have run in so much of issues that I could > just throw the laptop with windows out. > > Your view - Redhat, Ubuntu or Centos. > Does Redhat give a one year licence on purchase etc? > > Thanks > > On Mon, Apr 18, 2016 at

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
re Galore on Spark. > Since I am starting afresh, what would you advice? > > On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Binary for Spark means ts spark built against hadoop 2.6 >> It will not have any hadoop executables. &

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
binary format or will have to build it? > 3) Is there a basic tutorial for Hadoop on windows for the basic needs of > Spark. > > Thanks in Advance ! > > On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Once you download hadoop

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
Once you download hadoop and format the namenode , you can use start-dfs.sh to start hdfs. Then use 'jps' to sss if datanode/namenode services are up and running. Thanks Deepak On Mon, Apr 18, 2016 at 5:18 PM, My List wrote: > Hi , > > I am a newbie on Spark.I wanted to

LinkedIn streams in Spark

2016-04-10 Thread Deepak Sharma
Hello All, I am looking for a use case where anyone have used spark streaming integration with LinkedIn. -- Thanks Deepak

Re: Detecting application restart when running in supervised cluster mode

2016-04-05 Thread Deepak Sharma
Hi Rafael If you are using yarn as the engine , you can always use RM UI to see the application progress. Thanks Deepak On Tue, Apr 5, 2016 at 12:18 PM, Rafael Barreto wrote: > Hello, > > I have a driver deployed using `spark-submit` in supervised cluster mode. >

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Deepak Sharma
There is Spark action defined for oozie workflows. Though I am not sure if it supports only Java SPARK jobs or Scala jobs as well. https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Thanks Deepak On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot wrote: > Hi, >

Re: Newbie question

2016-01-07 Thread Deepak Sharma
Yes , you can do it unless the method is marked static/final. Most of the methods in SparkContext are marked static so you can't over ride them definitely , else over ride would work usually. Thanks Deepak On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote: >

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Invalid jobj 2. If SparkR was restarted, Spark operations need to be > re-executed. > > > Not sure what is causing this? Any leads or ideas? I am using rstudio. > > > > On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Hi Sandee

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep I am not sure if ORC can be read directly in R. But there can be a workaround .First create hive table on top of ORC files and then access hive table in R. Thanks Deepak On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana wrote: > Hello > > I need to read an ORC

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Deepak Sharma
I have never tried this but there is yarn client api's that you can use in your spark program to get the application id. Here is the link to the yarn client java doc: http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html getApplications() is the method for your

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Deepak Sharma
An approach I can think of is using Ambari Metrics Service(AMS) Using these metrics , you can decide upon if the cluster is low in resources. If yes, call the Ambari management API to add the node to the cluster. Thanks Deepak On Mon, Dec 14, 2015 at 2:48 PM, cs user

Any role for volunteering

2015-12-04 Thread Deepak Sharma
Hi All Sorry for spamming your inbox. I am really keen to work on a big data project full time(preferably remote from India) , if not I am open to volunteering as well. Please do let me know if there is any such opportunity available -- Thanks Deepak

Re: Hive on Spark orc file empty

2015-11-16 Thread Deepak Sharma
Sai, I am bit confused here. How are you using write with results? I am using spark 1.4.1 and when i use write , it complains about write not being member of DataFrame. error:value write is not a member of org.apache.spark.sql.DataFrame Thanks Deepak On Mon, Nov 16, 2015 at 4:10 PM, 张炜

Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
Hi All I am confused on RDD persistence in cache . If I cache RDD , is it going to stay there in memory even if my spark program completes execution , which created it. If not , how can I guarantee that RDD is persisted in cache even after the program finishes execution. Thanks Deepak

Re: Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
uot; <engr...@gmail.com> wrote: > The cache gets cleared out when the job finishes. I am not aware of a way > to keep the cache around between jobs. You could save it as an object file > to disk and load it as an object file on your next job for speed. > On Thu, Nov 5, 2015 at 6:1

Best practises

2015-10-30 Thread Deepak Sharma
Hi I am looking for any blog / doc on the developer's best practices if using Spark .I have already looked at the tuning guide on spark.apache.org. Please do let me know if any one is aware of any such resource. Thanks Deepak

<    1   2