Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Natu Lauchande
Hi, I think the spark UI will be accessible whenever you launch a spark app in the cluster it should be the Application Tracker link. Regards, Natu On Tue, Sep 13, 2016 at 9:37 AM, Divya Gehlot wrote: > Hi , > Thank you all.. > Hurray ...I am able to view the hadoop web UI now @ 8088 . even

Question on set membership / diff sync technique in Spark

2016-07-26 Thread Natu Lauchande
Hi, I am working on a data pipeline in a Spark Streaming app that receives data as a CSV regularly. After some enrichment we send the data to another storage layer(ES in the case). Some of the records in the incoming CSV might be repeated. I am trying to devise a strategy based on MD5's of the l

Can Spark Streaming checkpoint only metadata ?

2016-06-21 Thread Natu Lauchande
Hi, I wonder if it is possible to checkpoint only metadata and not the data in RDD's and dataframes. Thanks, Natu

Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Natu Lauchande
Hi, I am running some spark loads . I notice that in it only uses one of the machines(instead of the 3 available) of the cluster. Is there any parameter that can be set to force it to use all the cluster. I am using AWS EMR with Yarn. Thanks, Natu

RE: difference between dataframe and dataframwrite

2016-06-16 Thread Natu Lauchande
Hi Does anyone know wich one aws emr uses by default? Thanks, Natu On Jun 16, 2016 5:12 PM, "David Newberger" wrote: > DataFrame is a collection of data which is organized into named columns. > > DataFrame.write is an interface for saving the contents of a DataFrame to > external storage. > > >

Re: concat spark dataframes

2016-06-15 Thread Natu Lauchande
Hi, You can select the common collumns and use DataFrame.union all . Regards, Natu On Wed, Jun 15, 2016 at 8:57 PM, spR wrote: > hi, > > how to concatenate spark dataframes? I have 2 frames with certain columns. > I want to get a dataframe with columns from both the other frames. > > Regards,

Re: Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi, It seems to me that the checkpoint command is not persisting the SparkContext hadoop configuration correctly . Can this be a possibility ? Thanks, Natu On Mon, Jun 13, 2016 at 11:57 AM, Natu Lauchande wrote: > Hi, > > I am testing disaster recovery from Spark and having some is

Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi, I am testing disaster recovery from Spark and having some issues when trying to restore an input file from s3 : 2016-06-13 11:42:52,420 [main] INFO org.apache.spark.streaming.dstream.FileInputDStream$FileInputDStreamCheckpointData - Restoring files for time 146581086 ms - [s3n://bucketfoo

Issues when using the streaming checkpoint

2016-06-09 Thread Natu Lauchande
Hi, I am having the following error when using checkpoint in a spark streamming app : java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable I am following the example available in https://github.com/apache/spark/blob/m

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Natu Lauchande
Hi Mich, I am also interested in the write up. Regards, Natu On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh wrote: > Hi Al,, > > > Following the threads in spark forum, I decided to write up on > configuration of Spark including allocation of resources and configuration > of driver, executo

Re: Init/Setup worker

2016-05-10 Thread Natu Lauchande
Hi, Not sure if this might be helpful to you : https://github.com/ondra-m/ruby-spark . Regards, Natu On Tue, May 10, 2016 at 4:37 PM, Lionel PERRIN wrote: > Hello, > > > > I’m looking for a solution to use jruby on top of spark. The only tricky > point is that I need that every worker thread h

Re: DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
. If it’s 2 seconds then an RDD is created > every 2 seconds. > > > > Cheers, > > > > *David* > > > > *From:* Natu Lauchande [mailto:nlaucha...@gmail.com] > *Sent:* Tuesday, April 12, 2016 7:09 AM > *To:* user@spark.apache.org > *Subject:* DStream how

DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
Hi, What's the criteria for the number of RDD's created for each micro bath iteration ? Thanks, Natu

Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Natu Lauchande
Hi, Is it possible to have both a sqlContext and a hiveContext in the same application ? If yes would there be any performance pernalties of doing so. Regards, Natu

Re: Unable run Spark in YARN mode

2016-04-09 Thread Natu Lauchande
How are you trying to run spark ? locally ? spark submit ? On Sat, Apr 9, 2016 at 7:57 AM, maheshmath wrote: > I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error > > 16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh > 16/04/09 10:36:50 INFO spark.SecurityMana

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
, > Ben > > On Apr 8, 2016, at 9:15 PM, Natu Lauchande wrote: > > Hi Benjamin, > > I have done it . The critical configuration items are the ones below : > > ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", > "org.apache.hadoo

Re: Use only latest values

2016-04-09 Thread Natu Lauchande
I don't see this happening without a store. You can try parquet on top of hdfs. This will at least avoid third party systems burden. On 09 Apr 2016 9:04 AM, "Daniela S" wrote: > Hi, > > I would like to cache values and to use only the latest "valid" values to > build a sum. > In more detail, I r

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
s.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html > > > On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande > wrote: > >> Hi Benjamin, >> >> I have done it . The critical configuration items are the ones below : >> >>

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Natu Lauchande
Hi Benjamin, I have done it . The critical configuration items are the ones below : ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", AccessKeyId) ssc.spar

Develop locally with Yarn

2016-04-07 Thread Natu Lauchande
Hi, I working on a spark streamming app , when in local i use the "local[*]" as the master of my Spark Streamming Context . I wonder what would be need to develop locally and run it in Yarn through the IDE i am using IntelliJ idea. Thanks, Natu

Question around spark on EMR

2016-04-05 Thread Natu Lauchande
Hi, I am setting up a Scala spark streaming app in EMR . I wonder if anyone in the list can help me with the following question : 1. What's the approach that you guys have been using to submit in an EMR job step environment variables that will be needed by the Spark application ? 2. Can i have

Question Spark streaming - S3 textFileStream- How to get the current file name ?

2016-04-01 Thread Natu Lauchande
Hi, I am using spark streamming and using the input strategy of watching for files in S3 directories. Using the textFileStream method in the streamming context. The filename contains relevant for my pipeline manipulation i wonder if there is a more robust way to get this name other than captur

Re: Programatically create RDDs based on input

2015-10-31 Thread Natu Lauchande
Hi Amit, I don't see any default constructor in the JavaRDD docs https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html . Have you tried the following ? JavaRDD jRDD[] ; jRDD.add( jsc.textFile("/file1.txt") ) jRDD.add( jsc.textFile("/file2.txt") ) .. ; Natu On S

Re: How to lookup by a key in an RDD

2015-10-31 Thread Natu Lauchande
Hi, Looking here for the lookup function might help you: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions Natu On Sat, Oct 31, 2015 at 6:04 PM, swetha wrote: > Hi, > > I have a requirement wherein I have to load data from hdfs, build an RDD > and

Re: Cache in Spark

2015-10-09 Thread Natu Lauchande
I don't think so. Spark is not keeping the results in memory unless you tell it too. You have to explicitly call the cache method in your RDD: linesWithSpark.cache() Thanks, Natu On Fri, Oct 9, 2015 at 10:47 AM, vinod kumar wrote: > Hi Guys, > > May I know whether cache is enabled in spark

Re: Networking issues with Spark on EC2

2015-09-25 Thread Natu Lauchande
Hi, Are you using EMR ? Natu On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH wrote: > Hi Ankur, > Thanks for the reply. > This is already done. > If I wait for a long amount of time(10 minutes), a few tasks get > successful even on slave nodes. Sometime, a fraction of the tasks(20%) are > complet

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Natu Lauchande
Are you using Scala in a distributed enviroment or in a standalone mode ? Natu On Thu, Dec 11, 2014 at 8:23 PM, ll wrote: > hi.. i'm converting some of my machine learning python code into scala + > spark. i haven't been able to run it on large dataset yet, but on small > datasets (like http:/

Re: Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-25 Thread Natu Lauchande
elaborated in a chapter of an upcoming book that's available > in early release; you can look at the accompanying source code to get > some ideas too: https://github.com/sryza/aas/tree/master/kmeans > > On Mon, Nov 24, 2014 at 10:17 PM, Natu Lauchande > wrote: > > Hi all

Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-24 Thread Natu Lauchande
Hi all, I am getting started with Spark. I would like to use for a spike on anomaly detection in a massive stream of metrics. Can Spark easily handle this use case ? Thanks, Natu