Re: [Structured Streaming][Kafka] For a Kafka topic with 3 partitions, how does the parallelism work ?

2018-04-21 Thread Raghavendra Pandey
Yes as long as there are 3 cores available on your local machine. On Fri, Apr 20, 2018 at 10:56 AM karthikjay wrote: > I have the following code to read data from Kafka topic using the > structured > streaming. The topic has 3 partitions: > > val spark = SparkSession >

Kafka version support

2017-11-29 Thread Raghavendra Pandey
Just wondering if anyone has tried spark structured streaming kafka connector (2.2) with Kafka 0.11 or Kafka 1.0 version Thanks Raghav

Multiple queries on same stream

2017-08-08 Thread Raghavendra Pandey
I am using structured streaming to evaluate multiple rules on same running stream. I have two options to do that. One is to use forEach and evaluate all the rules on the row.. The other option is to express rules in spark sql dsl and run multiple queries. I was wondering if option 1 will result in

Re: Spark Streaming: Async action scheduling inside foreachRDD

2017-08-04 Thread Raghavendra Pandey
Did you try SparkContext.addSparkListener? On Aug 3, 2017 1:54 AM, "Andrii Biletskyi" wrote: > Hi all, > > What is the correct way to schedule multiple jobs inside foreachRDD method > and importantly await on result to ensure those jobs have completed >

Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-25 Thread Raghavendra Pandey
When you start a slave you pass address of master as a parameter. That slave will contact master and register itself. On Jan 25, 2017 4:12 AM, "kant kodali" wrote: > Hi, > > How do I dynamically add nodes to spark standalone cluster and be able to > discover them? Does

Re: Nested ifs in sparksql

2017-01-11 Thread Raghavendra Pandey
ippet ? >> >> >> >> On Tue, Jan 10, 2017 8:15 PM, Georg Heiler georg.kf.hei...@gmail.com >> wrote: >> >> Maybe you can create an UDF? >> >> Raghavendra Pandey <raghavendra.pan...@gmail.com> schrieb am Di., 10. >> Jan. 2017 um 20:04 Uh

Nested ifs in sparksql

2017-01-10 Thread Raghavendra Pandey
I have of around 41 level of nested if else in spark sql. I have programmed it using apis on dataframe. But it takes too much time. Is there anything I can do to improve on time here?

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Raghavendra Pandey
How large is your first text file? The idea is you read first text file and if it is not large you can collect all the lines on driver and then again read text files for each line and union all rdds. On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote: > Just wonder if this

Re: Passing Custom App Id for consumption in History Server

2016-09-03 Thread Raghavendra Pandey
Default implementation is to add milliseconds. For mesos it is framework-id. If you are using mesos, you can assume that your framework id used to register your app is same as app-id. As you said, you have a system application to schedule spark jobs, you can keep track of framework-ids submitted

Re: how to pass trustStore path into pyspark ?

2016-09-03 Thread Raghavendra Pandey
Did you try passing them in spark-env.sh? On Sat, Sep 3, 2016 at 2:42 AM, Eric Ho wrote: > I'm trying to pass a trustStore pathname into pyspark. > What env variable and/or config file or script I need to change to do this > ? > I've tried setting JAVA_OPTS env var but to

Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Raghavendra Pandey
If your file format is splittable say TSV, CSV etc, it will be distributed across all executors. On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar < somasundar.se...@tigeranalytics.com> wrote: > Hi All, > > > > Would like to gain some understanding on the questions listed below, > > > > 1.

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Raghavendra Pandey
Kapil -- I afraid you need to plugin your own SessionCatalog as ResolveRelations class depends on that. To keep up with consistent design you may like to implement ExternalCatalog as well. You can also look to plug in your own Analyzer class to give your more flexibility. Ultimately that is where

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Raghavendra Pandey
Can you please send me as well. Thanks Raghav On 12 May 2016 20:02, "Tom Ellis" wrote: > I would like to also Mich, please send it through, thanks! > > On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote: > >> Me too, send me the guide. >> >> Enviado desde

Re: Executor memory requirement for reduceByKey

2016-05-17 Thread Raghavendra Pandey
Even though it does not sound intuitive, reduce by key expects all values for a particular key for a partition to be loaded into memory. So once you increase the partitions you can run the jobs.

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
org/docs/latest/running-on-mesos.html> * >> >> On Wed, May 11, 2016 at 10:05 PM, Giri P <gpatc...@gmail.com> wrote: >> >>> I'm not using docker >>> >>> On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey < >>> raghavendra.pan...@

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
By any chance, are you using docker to execute? On 11 May 2016 21:16, "Raghavendra Pandey" <raghavendra.pan...@gmail.com> wrote: > On 11 May 2016 02:13, "gpatcham" <gpatc...@gmail.com> wrote: > > > > > > Hi All, > > > > I'm u

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
On 11 May 2016 02:13, "gpatcham" wrote: > > Hi All, > > I'm using --jars option in spark-submit to send 3rd party jars . But I don't > see they are actually passed to mesos slaves. Getting Noclass found > exceptions. > > This is how I'm using --jars option > > --jars

Re: Spark 1.6.0: substring on df.select

2016-05-11 Thread Raghavendra Pandey
You can create a column with count of /. Then take max of it and create that many columns for every row with null fillers. Raghav On 11 May 2016 20:37, "Bharathi Raja" wrote: Hi, I have a dataframe column col1 with values something like

Re: [Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-29 Thread Raghavendra Pandey
U can use date type... On Dec 29, 2015 9:02 AM, "Divya Gehlot" wrote: > Hi, > I am newbee to Spark , > My appologies for such a naive question > I am using Spark 1.4.1 and wrtiting code in scala . I have input data as > CSVfile which I am parsing using spark-csv package

Re: Save to paquet files failed

2015-10-22 Thread Raghavendra Pandey
Can ypu increase number of partitions and try... Also, i dont think you need to cache dfs before saving them... U can do away with that as well... Raghav On Oct 23, 2015 7:45 AM, "Ram VISWANADHA" wrote: > Hi , > I am trying to load 931MB file into an RDD, then

Re: REST api to avoid spark context creation

2015-10-18 Thread Raghavendra Pandey
You may like to look at spark job server. https://github.com/spark-jobserver/spark-jobserver Raghavendra

Re: repartition vs partitionby

2015-10-17 Thread Raghavendra Pandey
You can use coalesce function, if you want to reduce the number of partitions. This one minimizes the data shuffle. -Raghav On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri wrote: > Hi folks > > I need to reparation large set of data around(300G) as i see some portions >

Re: s3a file system and spark deployment mode

2015-10-17 Thread Raghavendra Pandey
You can add classpath info in hadoop env file... Add the following line to your $HADOOP_HOME/etc/hadoop/hadoop-env.sh export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/* Add the following line to $SPARK_HOME/conf/spark-env.sh export

Re: Complex transformation on a dataframe column

2015-10-17 Thread Raghavendra Pandey
Here is a quick code sample I can come up with : case class Input(ID:String, Name:String, PhoneNumber:String, Address: String) val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789", "houseNo:StreetNo:City:State:Zip"))).toDF() val formatAddress = udf { (s: String) =>

Re: s3a file system and spark deployment mode

2015-10-15 Thread Raghavendra Pandey
You can use spark 1.5.1 with no hadoop and hadoop 2.7.1.. Hadoop 2.7.1 is more mature for s3a access. You also need to set hadoop tools dir into hadoop classpath... Raghav On Oct 16, 2015 1:09 AM, "Scott Reynolds" wrote: > We do not use EMR. This is deployed on Amazon VMs

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Raghavendra Pandey
Looks like you are facing ipv6 issue. Can you try using preferIPv4 property on. On Oct 15, 2015 2:10 AM, "Steve Loughran" wrote: > > On 14 Oct 2015, at 20:56, Marco Mistroni wrote: > > > 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to

Re: Spark Master Dying saying TimeoutException

2015-10-14 Thread Raghavendra Pandey
I fixed these timeout errors by retrying... On Oct 15, 2015 3:41 AM, "Kartik Mathur" wrote: > Hi, > > I have some nightly jobs which runs every night but dies sometimes because > of unresponsive master , spark master logs says - > > Not seeing much else there , what could

Re: How to compile Spark with customized Hadoop?

2015-10-10 Thread Raghavendra Pandey
There is spark without hadoop version.. You can use that to link with any custom hadoop version. Raghav On Oct 10, 2015 5:34 PM, "Steve Loughran" wrote: > > During development, I'd recommend giving Hadoop a version ending with > -SNAPSHOT, and building spark with maven,

Re: Spark based Kafka Producer

2015-09-11 Thread Raghavendra Pandey
, Atul Kulkarni <atulskulka...@gmail.com> wrote: > I am submitting the job with yarn-cluster mode. > > spark-submit --master yarn-cluster ... > > On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > >> What is the

Re: about mr-style merge sort

2015-09-10 Thread Raghavendra Pandey
In mr jobs, the output is sorted only within reducer.. That can be better emulated by sorting each partition of rdd rather than total sorting the rdd.. In Rdd.mapPartition you can sort the data in one partition and try... On Sep 11, 2015 7:36 AM, "周千昊" wrote: > Hi, all >

Re: Spark based Kafka Producer

2015-09-10 Thread Raghavendra Pandey
What is the value of spark master conf.. By default it is local, that means only one thread can run and that is why your job is stuck. Specify it local[*], to make thread pool equal to number of cores... Raghav On Sep 11, 2015 6:06 AM, "Atul Kulkarni" wrote: > Hi Folks,

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Raghavendra Pandey
Did you specify partitioning column while saving data.. On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote: > Hello experts, > > I have a huge json file (> 40G) and trying to use Parquet as a file > format. Each entry has a unique identifier but other than that, it doesn't > have

Re: Unbale to run Group BY on Large File

2015-09-02 Thread Raghavendra Pandey
You can increase number of partitions n try... On Sep 3, 2015 5:33 AM, "Silvio Fiorito" wrote: > Unfortunately, groupBy is not the most efficient operation. What is it > you’re trying to do? It may be possible with one of the other *byKey > transformations. > >

Re: FlatMap Explanation

2015-09-02 Thread Raghavendra Pandey
Flatmap is just like map but it flattens out the seq output of the closure... In your case, you call "to" function that is to return list... a.to(b) returns list(a,...,b) So rdd.flatMap( x => x.to(3)) will take all element and return range upto 3.. On Sep 3, 2015 7:36 AM, "Ashish Soni"

Re: Spark Version upgrade isue:Exception in thread main java.lang.NoSuchMethodError

2015-08-30 Thread Raghavendra Pandey
Looks like ur version n spark's Jackson package are at different versions. Raghav On Aug 28, 2015 4:01 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi Team, I upgraded spark older versions to 1.4.1 after maven build i tried to ran my simple application but it failed and giving the

Re: Array Out OF Bound Exception

2015-08-29 Thread Raghavendra Pandey
So either you empty line at the end or when you use string.split you dont specify -1 as second parameter... On Aug 29, 2015 1:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I suspect in the last scenario you are having an empty new line at the last line. If you put a try..catch you'd

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Raghavendra Pandey
We are using Cassandra for similar kind of problem and it works well... You need to take care of race condition between updating the store and looking up the store... On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on Jason's suggestion. bq. this large variable is broadcast many

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-25 Thread Raghavendra Pandey
Did you try increasing sql partitions? On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com wrote: I am running this query on a data size of 4 billion rows and getting org.apache.spark.shuffle.FetchFailedException error. select adid,position,userid,price from ( select

Re: How to set environment of worker applications

2015-08-24 Thread Raghavendra Pandey
...@gmail.com wrote: spark-env.sh works for me in Spark 1.4 but not spark.executor.extraJavaOptions. On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file

Re: How to set environment of worker applications

2015-08-23 Thread Raghavendra Pandey
I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node. On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the

Re: How to list all dataframes and RDDs available in current session?

2015-08-21 Thread Raghavendra Pandey
You get the list of all the persistet rdd using spark context... On Aug 21, 2015 12:06 AM, Rishitesh Mishra rishi80.mis...@gmail.com wrote: I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Raghavendra Pandey
Did you try with hadoop version 2.7.1 .. It is known that s3a works really well with parquet which is available in 2.7. They fixed lot of issues related to metadata reading there... On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote: @Cheng, Hao : Physical plans show that it

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Raghavendra Pandey
Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row

Re: How to specify column type when saving DataFrame as parquet file?

2015-08-14 Thread Raghavendra Pandey
I think you can try dataFrame create api that takes RDD[Row] and Struct type... On Aug 11, 2015 4:28 PM, Jyun-Fan Tsai jft...@appier.com wrote: Hi all, I'm using Spark 1.4.1. I create a DataFrame from json file. There is a column C that all values are null in the json file. I found that

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Raghavendra Pandey
In spark 1.4 there is a parameter to control that. Its default value is 10 M. So you need to cache your dataframe to hint the size. On Aug 14, 2015 7:09 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Hi I am facing huge performance problem when I am trying to left outer join very big

Re: DataFrame column structure change

2015-08-08 Thread Raghavendra Pandey
You can use struct function of org.apache.spark.sql.function class to combine two columns to create struct column. Sth like. val nestedCol = struct(df(d), df(e)) df.select(df(a), df(b), df(c), nestedCol) On Aug 7, 2015 3:14 PM, Rishabh Bhardwaj rbnex...@gmail.com wrote: I am doing it by creating

Spark sql jobs n their partition

2015-08-08 Thread Raghavendra Pandey
I have a complex transformation requirements that i m implementing using dataframe. It involves lot of joins also with Cassandra table. I was wondering how can I debug the jobs n stages queued by spark sql the way I can do for Rdds. In one of cases, spark sql creates more than 17 lakhs tasks for

Create StructType column in data frame

2015-07-27 Thread Raghavendra Pandey
Hello, I would like to add a column of StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. Please

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
If you cache rdd it will save some operations. But anyway filter is a lazy operation. And it runs based on what you will do later on with rdd1 and rdd2... Raghavendra On Jul 16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote: If I write code like this: val rdd = input.map(_.value) val f1 =

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
if I would use both rdd1 and rdd2 later? Raghavendra Pandey raghavendra.pan...@gmail.com于2015年7月16日周四 下午4:08写道: If you cache rdd it will save some operations. But anyway filter is a lazy operation. And it runs based on what you will do later on with rdd1 and rdd2... Raghavendra On Jul 16, 2015

Re: Filter on Grouped Data

2015-07-03 Thread Raghavendra Pandey
Why dont you apply filter first and then Group the data and run aggregations.. On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com wrote: Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo Subject

Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on workers. On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote: Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to

Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote: Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like

Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote: I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of

Re: Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Raghavendra Pandey
This will not work i.e. using data frame inside map function.. Although you can try to create df separately n cache it... Then you can join your event stream with this df. On Jul 2, 2015 6:11 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , I have and Stream of Event coming in and i want

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-02 Thread Raghavendra Pandey
will be the best way to do it, Ashish On Wed, Jul 1, 2015 at 10:45 PM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: You cannot refer to one rdd inside another rdd.map function... Rdd object is not serialiable. Whatever objects you use inside map function should be serializable

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Raghavendra Pandey
So do you want to change the behavior of persist api or write the rdd on disk... On Jul 1, 2015 9:13 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I think i want to use persist then and write my intermediate RDDs to disk+mem. On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey raghavendra.pan

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Raghavendra Pandey
By any chance, are you using time field in your df. Time fields are known to be notorious in rdd conversion. On Jul 1, 2015 6:13 PM, Pooja Jain pooja.ja...@gmail.com wrote: Join is happening successfully as I am able to do count() after the join. Error is coming only while trying to write in

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Raghavendra Pandey
: i original assumed that persisting is similar to writing. But its not. Hence i want to change the behavior of intermediate persists. On Wed, Jul 1, 2015 at 8:46 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: So do you want to change the behavior of persist api or write the rdd

Re: Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Raghavendra Pandey
I am using spark driver as a rest service. I used spray.io to make my app rest server. I think this is a good design for applications that you want to keep in long running mode.. On Jul 1, 2015 6:28 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: You can try using Spark Jobserver

Re: BroadCast Multiple DataFrame ( JDBC Tables )

2015-07-01 Thread Raghavendra Pandey
I am not sure if you can broadcast data frame without collecting it on driver... On Jul 1, 2015 11:45 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi , I need to load 10 tables in memory and have them available to all the workers , Please let me me know what is the best way to do broadcast

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Raghavendra Pandey
You cannot refer to one rdd inside another rdd.map function... Rdd object is not serialiable. Whatever objects you use inside map function should be serializable as they get transferred to executor nodes. On Jul 2, 2015 6:13 AM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , I am not sure

Re: Using 'fair' scheduler mode

2015-04-01 Thread Raghavendra Pandey
I am facing the same issue. FAIR and FIFO behaving in the same way. On Wed, Apr 1, 2015 at 1:49 AM, asadrao as...@microsoft.com wrote: Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the first query is a very expensive query (ex: ‘select *’ on a really big data set)

Re: Executor lost with too many temp files

2015-02-25 Thread Raghavendra Pandey
Can you try increasing the ulimit -n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com wrote: Hi Sameer, I’m still using Spark 1.1.1, I think the default is hash shuffle. No external shuffle service. We are processing gzipped JSON files, the partitions are

Re: Shuffle read/write issue in spark 1.2

2015-02-05 Thread Raghavendra Pandey
Even I observed the same issue. On Fri, Feb 6, 2015 at 12:19 AM, Praveen Garg praveen.g...@guavus.com wrote: Hi, While moving from spark 1.1 to spark 1.2, we are facing an issue where Shuffle read/write has been increased significantly. We also tried running the job by rolling back to

Re: Cheapest way to materialize an RDD?

2015-02-02 Thread Raghavendra Pandey
You can also do something like rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = { while(iter.hasNext) iter.next() }) On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote: Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving

Re: Is there a way to delete hdfs file/directory using spark API?

2015-01-21 Thread Raghavendra Pandey
You can use Hadoop Client Api to remove files https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path, boolean). I don't think spark has any wrapper on hadoop filesystem APIs. On Thu, Jan 22, 2015 at 12:15 PM, LinQili lin_q...@outlook.com

Re: How to get the master URL at runtime inside driver program?

2015-01-19 Thread Raghavendra Pandey
If you pass spark master URL to spark-submit, you don't need to pass the same to SparkConf object. You can create SparkConf without this property or for that matter any other property that you pass in spark-submit. On Sun, Jan 18, 2015 at 7:38 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi,

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-18 Thread Raghavendra Pandey
If you are running spark in local mode, executor parameters are not used as there is no executor. You should try to set corresponding driver parameter to effect it. On Mon, Jan 19, 2015, 00:21 Sean Owen so...@cloudera.com wrote: OK. Are you sure the executor has the memory you think? -Xmx24g in

Re: Save RDD with partition information

2015-01-13 Thread Raghavendra Pandey
I believe the default hash partitioner logic in spark will send all the same keys to same machine. On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote: Hi, I have a usecase where in I have hourly spark job which creates hourly RDDs, which are partitioned by keys. At

Re: Web Service + Spark

2015-01-11 Thread Raghavendra Pandey
You can take a look at http://zeppelin.incubator.apache.org. it is a notebook and graphic visual designer. On Sun, Jan 11, 2015, 01:45 Cui Lin cui@hds.com wrote: Thanks, Gaurav and Corey, Probably I didn’t make myself clear. I am looking for best Spark practice similar to Shiny for R,

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-11 Thread Raghavendra Pandey
the saveAsParquetFile method to allows me to persist the avro schema inside parquet. Is this possible at all? Best Regards, Jerry On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I cam across this http://zenfractal.com/2013/08/21/a-powerful-big

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Raghavendra Pandey
You may like to look at spark.cleaner.ttl configuration which is infinite by default. Spark has that configuration to delete temp files time to time. On Fri Jan 09 2015 at 8:34:10 PM michael.engl...@nomura.com wrote: Hi, Is there a way of automatically cleaning up the spark.local.dir after

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. You can take a look. On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I have the similar kind of requirement where I want to push avro data into parquet. But it seems you have

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I have the similar kind of requirement where I want to push avro data into parquet. But it seems you have to do it on your own. There is parquet-mr project that uses hadoop to do so. I am trying to write a spark job to do similar kind of thing. On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first

Re: Spark app performance

2015-01-01 Thread Raghavendra Pandey
of Complex objects and hence it slows down the performance. Here's a link http://spark.apache.org/docs/latest/tuning.html to performance tuning if you haven't seen it already. Thanks Best Regards On Wed, Dec 31, 2014 at 8:45 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I

Re: FlatMapValues

2014-12-31 Thread Raghavendra Pandey
Why don't you push \n instead of \t in your first transformation [ (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t +fields(9)))] and then do saveAsTextFile? -Raghavendra On Wed Dec 31 2014 at 1:42:55 PM Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys My

Spark app performance

2014-12-30 Thread Raghavendra Pandey
I have a spark app that involves series of mapPartition operations and then a keyBy operation. I have measured the time inside mapPartition function block. These blocks take trivial time. Still the application takes way too much time and even sparkUI shows that much time. So i was wondering where

Re: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Raghavendra Pandey
It seems there is hadoop 1 somewhere in the path. On Fri, Dec 19, 2014, 21:24 Sean Owen so...@cloudera.com wrote: Yes, but your error indicates that your application is actually using Hadoop 1.x of some kind. Check your dependencies, especially hadoop-client. On Fri, Dec 19, 2014 at 2:11

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-08 Thread Raghavendra Pandey
Yeah, the dot notation works. It works even for arrays. But I am not sure if it can handle complex hierarchies. On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian lian.cs@gmail.com wrote: You may access it via something like SELECT filterIp.element FROM tb, just like Hive. Or if you’re using

Re: Locking for shared RDDs

2014-12-08 Thread Raghavendra Pandey
You don't need to worry about locks as such as one thread/worker is responsible exclusively for one partition of the RDD. You can use Accumulator variables that spark provides to get the state updates. On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye adbrihadarany...@gmail.com wrote: I am