The parameter spark.yarn.executor.memoryOverhead

2017-10-30 Thread Ashok Kumar
Hi Gurus, The parameter spark.yarn.executor.memoryOverhead is explained as below: spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads,

how does spark handle compressed files

2017-07-19 Thread Ashok Kumar
Hi, How does spark handle compressed files? Are they optimizable in terms of using multiple RDDs against the file pr one needs to uncompress them beforehand say bz type files. thanks

RDD and DataFrame persistent memory usage

2017-06-25 Thread Ashok Kumar
Gurus, I understand when we create RDD in Spark it is immutable. So I have few points please: - When RDD is created that is just a pointer. Not most Spark operations it is lazy not consumed until a collection operation done that affects RDD? - When a DF is created from RDD does that

Re: how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar
lusters and the type of processing you are trying to do. On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi Gurus, Within one Spark streaming process how many topics can be handled? I have not tried more than one topic. Thanks

how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar
Hi Gurus, Within one Spark streaming process how many topics can be handled? I have not tried more than one topic. Thanks

Re: Edge Node in Spark

2017-06-06 Thread Ashok Kumar
orking with straight spark or referring to GraphX? Thank You, Irving Duran On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi, I am a bit confused between Edge node, Edge server and gateway node in Spark. Do these mean the same thing? How does one set up

Edge Node in Spark

2017-06-05 Thread Ashok Kumar
Hi, I am a bit confused between Edge node, Edge server and gateway node in Spark. Do these mean the same thing? How does one set up an Edge node to be used in Spark? Is this different from Edge node for Hadoop please? Thanks

Re: High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar
e-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Feb 5, 2017 at 10:11 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > Hello, > > What are the practiced High Availability/DR operations for Spark cluster at > the moment. I am specially interested if YARN i

High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar
Hello, What are the practiced High Availability/DR operations for Spark cluster at the moment. I am specially interested if YARN is used as the resource manager. Thanks

Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Ashok Kumar
You are very kind Sir On Sunday, 30 October 2016, 16:42, Devopam Mittra wrote: +1 Thanks and regards Devopam On 30 Oct 2016 9:37 pm, "Mich Talebzadeh" wrote: Enjoy the festive season. Regards, Dr Mich Talebzadeh LinkedIn  

Re: Design considerations for batch and speed layers

2016-09-30 Thread Ashok Kumar
Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or something similar? On Friday, 30 September 2016, 17:17, Mich Talebzadeh wrote: I have designed this prototype for a risk business. Here I would like to discuss issues with batch

Spark Interview questions

2016-09-14 Thread Ashok Kumar
Hi, As a learner I appreciate if you have typical Spark interview questions for Spark/Scala junior roles that you can please forward to me. I will be very obliged

Re: dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
liable for any monetary damages arising from suchloss, damage or destruction.   On 7 September 2016 at 11:39, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean  DStream.for

dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks

Re: Getting figures from spark streaming

2016-09-07 Thread Ashok Kumar
Any help on this warmly appreciated. On Tuesday, 6 September 2016, 21:31, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote: Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I

Getting figures from spark streaming

2016-09-06 Thread Ashok Kumar
Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval. In the loop code this is what is hapening       dstream.foreachRDD { rdd =>     val x= rdd.count      i

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
rm in the array, convert it to your desired data type and then use filter.  On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98. 11218069128827594148 I want to filter anything > 50

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
ter first map, you will get RDD of arrays. What is your expected outcome of 2nd map?  On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Ar

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote: Hi, I have a text file as below that I read in 74,20160905-

Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi, I have a text file as below that I read in

Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ashok Kumar
Hi, What are practical differences between the new Data set in Spark 2 and the existing DataFrame. Has Dataset replaced Data Frame and what advantages it has if I use Data Frame instead of Data Frame. Thanks

Design patterns involving Spark

2016-08-28 Thread Ashok Kumar
Hi, There are design patterns that use Spark extensively. I am new to this area so I would appreciate if someone explains where Spark fits in especially within faster or streaming use case. What are the best practices involving Spark. Is it always best to deploy it for processing engine,  For

Spark standalone or Yarn for resourcing

2016-08-17 Thread Ashok Kumar
Hi, for small to medium size clusters I think Spark Standalone mode is a good choice. We are contemplating moving to Yarn as our cluster grows.  What are the pros and cons of using either please. Which one offers the best Thanking you

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
//talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monet

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
icitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.   On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these ta

parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you

num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Ashok Kumar
Hi I would like to know the exact definition for these three  parameters  num-executors executor-memory executor-cores for local, standalone and yarn modes I have looked at on-line doc but not convinced if I understand them correct. Thanking you 

Windows operation orderBy desc

2016-08-01 Thread Ashok Kumar
Hi, in the following Window spec I want orderBy ("") to be displayed in descending order please val W = Window.partitionBy("col1").orderBy("col2") If I Do val W = Window.partitionBy("col1").orderBy("col2".desc) It throws error console>:26: error: value desc is not a member of String How can I

The main difference use case between orderBY and sort

2016-07-29 Thread Ashok Kumar
Hi, In Spark programing I can use df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).orderBy("transactiondate").show(5) or df.filter(col("transactiontype") ===

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar
Thanks Mich looking forward to it :) On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh wrote: Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station,

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Any expert advice warmly acknowledged.. thanking yo On Monday, 11 July 2016, 17:24, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi Gurus, Advice appreciated from Hive gurus. My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP

Re: Spark as sql engine on S3

2016-07-08 Thread Ashok Kumar
age is that using Spark SQL will be much faster? regards On Friday, 8 July 2016, 6:30, ayan guha <guha.a...@gmail.com> wrote: Yes, it can.  On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: thanks so basically Spark Thrift Server runs on a port mu

Re: Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
ou can connect to it from any jdbc tool like squirrel On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up wil

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar
Thanks. Will this presentation recorded as well? Regards On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London 

Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards

ORC or parquet with Spark

2016-07-04 Thread Ashok Kumar
With Spark caching which file format is best to use parquet or ORC Obviously ORC can be used with Hive.  My question is whether Spark can use various file, stripe rowset statistics stored in ORC file? Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not offer any

latest version of Spark to work OK as Hive engine

2016-07-02 Thread Ashok Kumar
Hi, Looking at this presentation Hive on Spark is Blazing Fast .. Which latest version of Spark can run as an engine for Hive please? Thanks P.S. I am aware of  Hive on TEZ but that is not what I am interested here please Warmest regards

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Thank you all sirs Appreciated Mich your clarification. On Sunday, 19 June 2016, 19:31, Mich Talebzadeh wrote: Thanks Jonathan for your points I am aware of the fact yarn-client and yarn-cluster are both depreciated (still work in 1.6.1), hence the new

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
single JVM that has a master and one executor with `k` threads.https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94 // maropu On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi,

Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Hi, I have been told Spark in Local mode is simplest for testing. Spark document covers little on local mode except the cores used in --master local[k].  Where are the the driver program, executor and resources. Do I need to start worker threads and how many app I can use safely without

Re: Running Spark in Standalone or local modes

2016-06-12 Thread Ashok Kumar
st Yarn mode or Mesos mode, which means spark uses Yarn or Mesos as cluster managements. Local mode is actually a standalone mode which everything runs on the single local machine instead of remote clusters. That is my understanding. On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar <ashok34...@ya

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
11:38 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks

Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks

Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Ashok Kumar
Anyone can help me with this please On Sunday, 5 June 2016, 11:06, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own c

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
cala/TwitterAnalyzer/build.sbt#L18-19)[warn]            +- scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency: com.databricks#apps.twitter_classifier;1.0.0: not found Any ideas? regards, On Sunday, 5 June 2016, 22:22, Jacek Laskowski <ja...@japila.pl> wrote: On Sun,

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
r #1, please find examples on the nete.g. http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html For #2, import . getCheckpointDirectory Cheers On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: Thank you sir. At compile time can I do something similar to libr

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
PSHOT.jar' to pass the jar. Cheers On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes a

Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory {  def

processing twitter data

2016-05-31 Thread Ashok Kumar
hi all, i know very little about the subject. we would like to get streaming data from twitter and facebook. so questions please may i - what format is data from twitter. is it jason format - can i use spark and spark streaming for analyzing data - can data be fed in/streamed via

Does Spark support updates or deletes on underlying Hive tables

2016-05-30 Thread Ashok Kumar
Hi, I can do inserts from Spark on Hive tables. How about updates or deletes. They are failing when I tried? Thanking

Using Java in Spark shell

2016-05-24 Thread Ashok Kumar
Hello, A newbie question. Is it possible to use java code directly in spark shell without using maven to build a jar file? How can I switch from scala to java in spark shell? Thanks

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work?  Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadeh wrote: Have a look at this thread Dr Mich

Monitoring Spark application progress

2016-05-16 Thread Ashok Kumar
Hi, I would like to know the approach and tools please to get the full performance for a Spark app running through Spark-shell and Spark-sumbit - Through Spark GUI at 4040? - Through OS utilities top, SAR  - Through Java tools like jbuilder etc - Through integration Spark with

Spark handling spill overs

2016-05-12 Thread Ashok Kumar
Hi, How one can avoid having Spark spill over after filling the node's memory. Thanks

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar
Hi Dr Mich, I will be very keen to have a look at it and review if possible. Please forward me a copy Thanking you warmly On Thursday, 12 May 2016, 11:08, Mich Talebzadeh wrote: Hi Al,, Following the threads in spark forum, I decided to write up on

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
, 10:49, Saisai Shao <sai.sai.s...@gmail.com> wrote: Pease see the inline comments. On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do w

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application,

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Ashok Kumar
Thanks Michael as I gathered for now it is a feature. On Monday, 25 April 2016, 18:36, Michael Armbrust wrote: When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method.  Spark doesn't have access to this scope, so

Invoking SparkR from Spark shell

2016-04-20 Thread Ashok Kumar
Hi, I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with Spark. Is there a s hell similar to spark-shell that supports R besides Scala please? Thanks

Re: Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
y mean replacing the whole of Hadoop?   David   From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Thursday, April 14, 2016 2:13 PM To: User Subject: Spark replacing Hadoop   Hi,   I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark!

Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hi, I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark! Does this make sense and if so how accurate is it? Best

Spark GUI, Workers and Executors

2016-04-09 Thread Ashok Kumar
On Spark GUI I can see the list of Workers. I always understood that workers are used by executors. What is the relationship between workers and executors please. Is it one to one? Thanks

Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ashok Kumar
Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to copy table data to external files and then move the external files to a local target directory and populate the tables in target Hive with data. Is there an easier way of doing

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar
Is simple streaming mean continuous streaming and windows streaming time window? val ssc = new StreamingContext(sparkConf, Seconds(10)) thanks

Working out SQRT on a list

2016-04-02 Thread Ashok Kumar
Hi  I like a simple sqrt operation on a list but I don't get the result scala val l = List (1,5,786,25)l: List[Int] = List(1, 5, 786, 25) scala> l.map(x => x * x)res42: List[Int] = List(1, 25, 617796, 625) scala> l.map(x => x * x).sqrt:28: error: value sqrt is not a member of List[Int]           

Spark process creating and writing to a Hive ORC table

2016-03-31 Thread Ashok Kumar
Hello, How feasible is to use Spark to extract csv files and creates and writes the content to an ORC table in a Hive database. Is Parquet file the best (optimum) format to write to HDFS from Spark app. Thanks

Re: Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
016 at 22:07, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application

Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application layer) with HDFS and Hive may be providing the data tier.  Can

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar
Hello Mich If you accommodate can you please share your approach to steps 1-3 above.  Best regards On Sunday, 27 March 2016, 14:53, Mich Talebzadeh wrote: Pretty simple as usual it is a combination of ETL and ELT. Basically csv files are loaded into staging

Re: Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
e/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw  http://talebzadehmich.wordpress.com  On 25 March 2016 at 22:12, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks

Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
umn expressions.  The reason is that for maps, we have to >actually materialize and object to pass to your function.  However, if you >stick to column expression we can actually work directly on serialized data. On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar <ashok34...@yahoo.com> w

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
.(firstcolumn) in above when mapping if possible so columns will have labels On Thursday, 24 March 2016, 0:18, Michael Armbrust <mich...@databricks.com> wrote: You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar

calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
Gurus, If I register a temporary table as below  r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4:

reading csv file, operation on column or columns

2016-03-20 Thread Ashok Kumar
Gurus, I would like to read a csv file into a Data Frame but able to rename the column name, change a column type from String to Integer or drop the column from further analysis before saving data as parquet file? Thanks

Setting up spark to run on two nodes

2016-03-19 Thread Ashok Kumar
Experts. Please your valued advice. I have spark 1.5.2 set up as standalone for now and I have started the master as below start-master.sh I also have modified config/slave file to have  # A Spark Worker will be started on each of the machines listed below. localhostworkerhost On the localhost

shuffle in spark

2016-03-14 Thread Ashok Kumar
experts, please I need to understand how shuffling works in Spark and which parameters influence it. I am sorry but my knowledge of shuffling is very limited. Need a practical use case if you can. regards

Spark configuration with 5 nodes

2016-03-10 Thread Ashok Kumar
  Hi, We intend  to use 5servers which will be utilized for building Bigdata Hadoop data warehousesystem (not using any propriety distribution like Hortonworks or Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 cores, Ubuntu Linux servers. Hadoop will be

HBASE

2016-03-09 Thread Ashok Kumar
Hi Gurus, I am relatively new to Big Data and know some about Spark and Hive. I was wondering do I need to pick up skills on Hbase as well. I am not sure how it works but know that it is kind of columnar NoSQL database. I know it is good to know something new in Big Data space. Just wondering if

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar
On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu <shixi...@databricks.com> wrote: For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok

Converting array to DF

2016-03-01 Thread Ashok Kumar
Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread Ashok Kumar
Thank you all for valuable advice. Much appreciated Best On Sunday, 28 February 2016, 21:48, Ashok Kumar <ashok34...@yahoo.com> wrote:   Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill

Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ashok Kumar
  Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and action methods. FYI, I have already looked at examples on net. However, some of them not clear at least to me. Warmest

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
no particular reason. just wanted to know if there was another way as well. thanks On Saturday, 27 February 2016, 22:12, Yin Yang <yy201...@gmail.com> wrote: Is there particular reason you cannot use temporary table ? Thanks On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar &l

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
;a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show++|struct(id, b, a)|+--------+|       [2,foo,a]||      

Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
Hello, I like to be able to solve this using arrays. I have two dimensional array of (String,Int) with 5  entries say arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19) I like to write a small code to order these in the order of highest Int column so I will have arr("A",20),

Clarification on RDD

2016-02-26 Thread Ashok Kumar
Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md") my question is

d.filter("id in max(id)")

2016-02-25 Thread Ashok Kumar
Hi, How can I make that work? val d = HiveContext.table("table") select * from table where ID = MAX(ID) from table Thanks

select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Ashok Kumar
Hi, What is the equivalent of this in Spark please select * from mytable where column1 in (select max(column1) from mytable) Thanks

Filter on a column having multiple values

2016-02-24 Thread Ashok Kumar
Hi, I would like to do the following select count(*) from where column1 in (1,5)) I define scala> var t = HiveContext.table("table") This workst.filter($"column1" ===1) How can I expand this to have column1  for both 1 and 5 please? thanks

Re: Execution plan in spark

2016-02-24 Thread Ashok Kumar
ath '/tmp/partitioned'    )""")     table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards

Execution plan in spark

2016-02-24 Thread Ashok Kumar
Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards

Re: install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
--jars syntax. You might find  http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19, 2016 at 7:26 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: Hi, I downloaded the zipped csv libraries from databricks/spark-csv |   | |   | |   |   |   |   |   | | data

install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
Hi, I downloaded the zipped csv libraries from databricks/spark-csv |   | |   | |   |   |   |   |   | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames | | | | View on github.com | Preview by Yahoo | | | |   | Now I have a directory created called

Re: How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, class body thanks On Friday, 19 February 2016, 11:23, Ted Yu <yuzhih...@gmail.com> wrote: Can you clarify your question ? Did you mean the body of your class ? On Feb 19, 2016, at 4:43 AM, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote: Hi, If I define a class in Sca

How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, If I define a class in Scala like case class(col1: String, col2:Int,...) and it is created how would I be able to see its description anytime Thanks

Use case for RDD and Data Frame

2016-02-16 Thread Ashok Kumar
Gurus, What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF) Where one can use RDD without transforming it to DF? Regards and obliged

How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ashok Kumar
Gurus, I am trying to run some examples given under directory examples spark/examples/src/main/scala/org/apache/spark/examples/ I am trying to run HdfsTest.scala However, when I run HdfsTest.scala  against spark shell it comes back with error Spark context available as sc. SQL context available

  1   2   >