from:"Ashok Kumar"

The parameter spark.yarn.executor.memoryOverhead

2017-10-30 Thread Ashok Kumar

Hi Gurus,

The parameter spark.yarn.executor.memoryOverhead is explained as below:

spark.yarn.executor.memoryOverhead 
executorMemory * 0.10, with minimum of 384
The amount of off-heap memory (in megabytes) to be allocated per executor. This 
is memory that accounts for things like VM overheads, interned strings, other 
native overheads, etc. This tends to grow with the executor size (typically 
6-10%). 

 

So does that mean that for executor of 10GB this should be ideally set to ~ 10% 
= 1GB?  

   
What would happen if we set it higher to say 30% ~ 3GB.
What is this memory is exactly used for (as opposed to memory allocated to the 
executor)?

Thanking you

how does spark handle compressed files

2017-07-19 Thread Ashok Kumar

Hi,
How does spark handle compressed files? Are they optimizable in terms of using 
multiple RDDs against the file pr one needs to uncompress them beforehand say 
bz type files.
thanks

RDD and DataFrame persistent memory usage

2017-06-25 Thread Ashok Kumar

 Gurus,
I understand when we create RDD in Spark it is immutable.
So I have few points please:
   
   - When RDD is created that is just a pointer. Not most Spark operations it 
is lazy not consumed until a collection operation done that affects RDD?
   - When a DF is created from RDD does that result in additional memory to DF. 
Again with collection operation that affects both RDD and DF built from that 
RDD?
   - There is some references that as you build operations and creating new 
DFs, one is consuming more and more memory without releasing it back?
   - What will happen if I do df.unpersist. I know that it shifts DF from 
memory (cache) to disk. Will that reduce memory overhead?
   - Is it a good idea to unpersist to reduce memory overhead?


Thanking you

Re: how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar

thank you
in the following example
   val topics = "test1,test2,test3"
    val brokers = "localhost:9092"
    val topicsSet = topics.split(",").toSet
    val sparkConf = new 
SparkConf().setAppName("KafkaDroneCalc").setMaster("local") 
//spark://localhost:7077
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(30))
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder] (ssc, kafkaParams, topicsSet)
  it is possible to have three topics or many topics?

 

On Monday, 19 June 2017, 20:10, Michael Armbrust  
wrote:
 

 I don't think that there is really a Spark specific limit here.  It would be a 
function of the size of your spark / kafka clusters and the type of processing 
you are trying to do.
On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar  
wrote:

 Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not 
tried more than one topic.
Thanks

how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar

 Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not 
tried more than one topic.
Thanks

Re: Edge Node in Spark

2017-06-06 Thread Ashok Kumar

Just Straight Spark please.
Also if I run a spark job using Python or Scala using Yarn where the log files 
are kept in the edge node?  Are these under logs directory for yarn?
thanks 

On Tuesday, 6 June 2017, 14:11, Irving Duran  wrote:
 

 Ashok,Are you working with straight spark or referring to GraphX?

Thank You,

Irving Duran
On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar  
wrote:

Hi,

I am a bit confused between Edge node, Edge server and gateway node in Spark.

Do these mean the same thing?

How does one set up an Edge node to be used in Spark? Is this different from 
Edge node for Hadoop please?

Thanks

-- -- -
To unsubscribe e-mail: user-unsubscribe@spark.apache. org

Edge Node in Spark

2017-06-05 Thread Ashok Kumar

Hi,

I am a bit confused between Edge node, Edge server and gateway node in Spark. 

Do these mean the same thing?

How does one set up an Edge node to be used in Spark? Is this different from 
Edge node for Hadoop please?

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar

Hi,
High Availability means that the system including Spark will carry on with 
minimal disruption in case of active component failure. DR or disaster recovery 
means total fail-over to another location with its own nodes. HDFS and Spark 
cluster
Thanks   

On Sunday, 5 February 2017, 20:15, Jacek Laskowski  wrote:

 Hi,

I'm not very familiar with "High Availability/DR operations". Could
you explain what it is? My very limited understanding of the phrase
allows me to think that with YARN and cluster deploy mode you've
failure recovery for free so when your drivers dies YARN will attempt
to resurrect it a few times. The other "components", i.e. map shuffle
stages, partitions/tasks, are handled by Spark itself.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Sun, Feb 5, 2017 at 10:11 AM, Ashok Kumar
 wrote:
> Hello,
>
> What are the practiced High Availability/DR operations for Spark cluster at
> the moment. I am specially interested if YARN is used as the resource
> manager.
>
> Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar

Hello,
What are the practiced High Availability/DR operations for Spark cluster at the 
moment. I am specially interested if YARN is used as the resource manager.
Thanks

Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Ashok Kumar

You are very kind Sir 

On Sunday, 30 October 2016, 16:42, Devopam Mittra  wrote:
 

 +1
Thanks and regards
Devopam
On 30 Oct 2016 9:37 pm, "Mich Talebzadeh"  wrote:

Enjoy the festive season.
Regards,
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.

Pros and cons of using different persistence layers for Spark

2016-10-03 Thread Ashok Kumar

What are the pros and cons of using different persistence layers for Spark, 
such as S3,Cassandra, and HDFS?
Thanks

Re: Design considerations for batch and speed layers

2016-09-30 Thread Ashok Kumar

Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or 
something similar?


 

On Friday, 30 September 2016, 17:17, Mich Talebzadeh 
 wrote:
 

 I have designed this prototype for a risk business. Here I would like to 
discuss issues with batch layer. Apologies about being long winded.

Business objective
Reduce risk in the credit business while making better credit and trading 
decisions. Specifically, to identify risk trends within certain years of 
trading data. For example, measure the risk exposure in a give portfolio by 
industry, region, credit rating and other parameters. At the macroscopic level, 
analyze data across market sectors, over a given time horizon to asses risk 
changes  DeliverableEnable real time and batch analysis of risk data Batch 
technology stack usedKafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, 
Spark as the query tool, Zeppelin
Test volumes for POC1 message queue (csv format), 100 stock prices streaming in 
very 2 seconds, 180K prices per hour, 4 million + per day   
 
   - prices to Kafka -> Zookeeper -> Flume -> HDFS
   - HDFS daily partition for that day's data
   - Hive external table looking at HDFS partitioned location
   - Hive managed table populated every 15 minutes via cron from Hive external 
table (table type ORC partitioned by date). This is purely Hive job. Hive table 
is populated using insert/overwrite for that day to avoid boundary 
value/missing data etc.
   - Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 
minutes
   - Data in Hive table has 15 minutes latency
   - Zeppelin to be used as UI with Spark 

Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within 
Spark shell, users can access batch tables in Hive or they have a choice of 
accessing raw data on HDFS files which gives them real time access  (not to be 
confused with speed layer).  Using typical query with Spark, to see the last 15 
minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my 
typical query not user query) on Hive tables this time using Spark takes 6 
seconds.
However, there are some  design concerns:
   
   - Zeppelin starts slowing down by the end of day. Sometimes it throws broken 
pipe message. I resolve this by restarting Zeppelin daemon. Potential show 
stopper
   - As the volume of data increases throughout the day, performance becomes an 
issue
   - Every 15 minutes when the cron starts, Hive insert/overwrites can 
potentially get in conflict with users throwing queries from Zeppelin/Spark. I 
am sure that with exclusive writes, Hive will block all users from accessing 
these tables (at partition level) until insert overwrite is done. This can be 
improved by better partitioning of Hive tables or relaxing ingestion time to 
half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive 
but really no difference in performance gain. I have thought of replacing Hive 
with Hbase etc. but that brings new complications in as well without 
necessarily solving the issue.
   - I am not convinced this design can scale up easily with 5 times more 
volume of data. 
   - We will also get real time data from RDBMS tables (Oracle, Sybase, 
MSSQL)using replication technologies such as Sap Replication Server. These 
currently deliver changed log data to Hive tables. So there is some 
compatibility issue here. 

So I am sure some members can add useful ideas :)
Thanks
Mich

 LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.

Spark Interview questions

2016-09-14 Thread Ashok Kumar

Hi,
As a learner I appreciate if you have typical Spark interview questions for 
Spark/Scala junior roles that you can please forward to me.
I will be very obliged

Re: dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar

I have checked that doc sir.
My understand every batch interval of data always generates one RDD, So why do 
we need to use foreachRDD when there is only one.

Sorry for this question but bit confusing me.
Thanks

 

On Wednesday, 7 September 2016, 18:05, Mich Talebzadeh 
 wrote:
 

 Hi,
What is so confusing about RDD. Have you checked this doc?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
HTH
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 7 September 2016 at 11:39, Ashok Kumar  wrote:

Hi,
A bit confusing to me
How many layers involved in DStream.foreachRDD.
Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > }
I am trying to get individual lines in RDD.
Thanks

dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar

Hi,
A bit confusing to me
How many layers involved in DStream.foreachRDD.
Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > }
I am trying to get individual lines in RDD.
Thanks

Re: Getting figures from spark streaming

2016-09-07 Thread Ashok Kumar

Any help on this warmly appreciated. 

On Tuesday, 6 September 2016, 21:31, Ashok Kumar 
 wrote:
 

 Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I sent 5 prices in each batch interval. In the loop code this 
is what is hapening
      dstream.foreachRDD { rdd =>     val x= rdd.count
     i += 1     println(s"> rdd loop i is ${i}, number of lines is  ${x} 
<==")     if (x > 0) {       println(s"processing ${x} 
records=")       var words1 = 
rdd.map(_._2).map(_.split(',').view(0)).map(_.toInt).collect.apply(0)        
println (words1)       var words2 = 
rdd.map(_._2).map(_.split(',').view(1)).map(_.toString).collect.apply(0)        
println (words2)       var price = 
rdd.map(_._2).map(_.split(',').view(2)).map(_.toFloat).collect.apply(0)        
println (price)        rdd.collect.foreach(println)       }     }

My tuple looks like this
// (null, "ID       TIMESTAMP                           PRICE")// (null, 
"40,20160426-080924,                  67.55738301621814598514")
And this the sample output from the run
processing 5 
records=320160906-21250980.224686(null,3,20160906-212509,80.22468448052631637099)(null,1,20160906-212509,60.40695324215582386153)(null,4,20160906-212509,61.95159400693415572125)(null,2,20160906-212509,93.05912099305473237788)(null,5,20160906-212509,81.08637370113427387121)
Now it does process the first values 3, 20160906-212509, 80.224686  for record 
(null,3,20160906-212509,80.22468448052631637099) but ignores the rest. of 4 
records. How can I make it go through all records here? I want the third column 
from all records!
Greetings

Getting figures from spark streaming

2016-09-06 Thread Ashok Kumar

Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I sent 5 prices in each batch interval. In the loop code this 
is what is hapening
      dstream.foreachRDD { rdd =>     val x= rdd.count
     i += 1     println(s"> rdd loop i is ${i}, number of lines is  ${x} 
<==")     if (x > 0) {       println(s"processing ${x} 
records=")       var words1 = 
rdd.map(_._2).map(_.split(',').view(0)).map(_.toInt).collect.apply(0)        
println (words1)       var words2 = 
rdd.map(_._2).map(_.split(',').view(1)).map(_.toString).collect.apply(0)        
println (words2)       var price = 
rdd.map(_._2).map(_.split(',').view(2)).map(_.toFloat).collect.apply(0)        
println (price)        rdd.collect.foreach(println)       }     }

My tuple looks like this
// (null, "ID       TIMESTAMP                           PRICE")// (null, 
"40,20160426-080924,                  67.55738301621814598514")
And this the sample output from the run
processing 5 
records=320160906-21250980.224686(null,3,20160906-212509,80.22468448052631637099)(null,1,20160906-212509,60.40695324215582386153)(null,4,20160906-212509,61.95159400693415572125)(null,2,20160906-212509,93.05912099305473237788)(null,5,20160906-212509,81.08637370113427387121)
Now it does process the first values 3, 20160906-212509, 80.224686  for record 
(null,3,20160906-212509,80.22468448052631637099) but ignores the rest. of 4 
records. How can I make it go through all records here? I want the third column 
from all records!
Greetings

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar

Thanks everyone.
I am not skilled like you gentlemen
This is what I did
1) Read the text file
val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.
3) Create a DF after splitting the file into an Array 
val df = textFile.map(line => 
line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF
4) Create a class for column headers
 case class Columns(col1: Int, col2: String, col3: Double)
5) Assign the column headers 
val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, 
p(2).toString.toDouble))
6) Only interested in column 3 > 50
 h.filter(col("Col3") > 50.0)
7) Now I just want Col3 only
h.filter(col("Col3") > 50.0).select("col3").show(5)+-+|         
    
col3|+-+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-+only
 showing top 5 rows
Does that make sense. Are there shorter ways gurus? Can I just do all this on 
RDD without DF?
Thanking you




 

On Monday, 5 September 2016, 15:19, ayan guha  wrote:
 

 Then, You need to refer third term in the array, convert it to your desired 
data type and then use filter. 

On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar  wrote:

Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   



-- 
Best Regards,
Ayan Guha

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar

Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98.11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

On Monday, 5 September 2016, 15:07, ayan guha  wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What 
is your expected outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar  
wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ 
Array[String]] = MapPartitionsRDD[27] at map at :27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar  wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking



   



-- 
Best Regards,
Ayan Guha

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: 
org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at 
:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | 
):27: error: value getString is not a member of Array[String]       
textFile.map(x=> x.split(",")).map(x => (x.getString(0))
regards

 

On Monday, 5 September 2016, 13:51, Somasundaram Sekar 
 wrote:
 

 Basic error, you get back an RDD on transformations like 
map.sc.textFile("filename").map(x => x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar"  wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 
5277699881591680774276,20160905-133143,56. 
0802995712398098455677,20160905-133143,46. 
636895265444075228,20160905-133143,84. 
8822714116440218155179,20160905-133143,68. 72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[ String]       
textFile.map(x=>x.toString). split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking

Splitting columns from a text file

2016-09-05 Thread Ashok Kumar

Hi,
I have a text file as below that I read in
74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.636895265444075228,20160905-133143,84.8822714116440218155179,20160905-133143,68.72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile.txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString).split(","):27: error: value split 
is not a member of org.apache.spark.rdd.RDD[String]       
textFile.map(x=>x.toString).split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to 
String?
Thanking

Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ashok Kumar

Hi,
What are practical differences between the new Data set in Spark 2 and the 
existing DataFrame.
Has Dataset replaced Data Frame and what advantages it has if I use Data Frame 
instead of Data Frame.
Thanks

Design patterns involving Spark

2016-08-28 Thread Ashok Kumar

Hi,
There are design patterns that use Spark extensively. I am new to this area so 
I would appreciate if someone explains where Spark fits in especially within 
faster or streaming use case.
What are the best practices involving Spark. Is it always best to deploy it for 
processing engine, 
For example when we have a pattern 
Input Data -> Data in Motion -> Processing -> Storage 
Where does Spark best fit in.
Thanking you

Spark standalone or Yarn for resourcing

2016-08-17 Thread Ashok Kumar

Hi,
for small to medium size clusters I think Spark Standalone mode is a good 
choice.
We are contemplating moving to Yarn as our cluster grows. 
What are the pros and cons of using either please. Which one offers the best
Thanking you

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar

Thank you very much sir.
I forgot to mention that two of these Oracle tables are range partitioned. In 
that case what would be the optimum number of partitions if you can share?
Warmest 

On Sunday, 14 August 2016, 21:37, Mich Talebzadeh 
 wrote:
 

 If you have primary keys on these tables then you can parallelise the process 
reading data.
You have to be careful not to set the number of partitions too many. Certainly 
there is a balance between the number of partitions supplied to JDBC and the 
load on the network and the source DB.
Assuming that your underlying table has primary key ID, then this will create 
20 parallel processes to Oracle DB
 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> _ORACLEserver,
 "dbtable" -> "(SELECT , , FROM )",
 "partitionColumn" -> "ID",
 "lowerBound" -> "1",
 "upperBound" -> "maxID",
 "numPartitions" -> "20",
 "user" -> _username,
 "password" -> _password)).load
assuming your upper bound on ID is maxID

This will open multiple connections to RDBMS, each getting a subset of data 
that you want.
You need to test it to ensure that you get the numPartitions optimum and you 
don't overload any component.
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 21:15, Ashok Kumar  wrote:

Hi,
There are 4 tables ranging from 10 million to 100 million rows but they all 
have primary keys.
The network is fine but our Oracle is RAC and we can only connect to a 
designated Oracle node (where we have a DQ account only).
We have a limited time window of few hours to get the required data out.
Thanks 

On Sunday, 14 August 2016, 21:07, Mich Talebzadeh 
 wrote:
 

 How big are your tables and is there any issue with the network between your 
Spark nodes and your Oracle DB that adds to issues?
HTH
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 20:50, Ashok Kumar  wrote:

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar

Hi,
There are 4 tables ranging from 10 million to 100 million rows but they all 
have primary keys.
The network is fine but our Oracle is RAC and we can only connect to a 
designated Oracle node (where we have a DQ account only).
We have a limited time window of few hours to get the required data out.
Thanks 

On Sunday, 14 August 2016, 21:07, Mich Talebzadeh 
 wrote:
 

 How big are your tables and is there any issue with the network between your 
Spark nodes and your Oracle DB that adds to issues?
HTH
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 20:50, Ashok Kumar  wrote:

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you

parallel processing with JDBC

2016-08-14 Thread Ashok Kumar

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you

num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Ashok Kumar

Hi
I would like to know the exact definition for these three  parameters 
num-executors
executor-memory
executor-cores

for local, standalone and yarn modes

I have looked at on-line doc but not convinced if I understand them correct.
Thanking you

Windows operation orderBy desc

2016-08-01 Thread Ashok Kumar

Hi,
in the following Window spec I want orderBy ("") to be displayed in 
descending order please
val W = Window.partitionBy("col1").orderBy("col2")
If I Do
val W = Window.partitionBy("col1").orderBy("col2".desc)

It throws error
console>:26: error: value desc is not a member of String

How can I achieve that?
Thanking you

The main difference use case between orderBY and sort

2016-07-29 Thread Ashok Kumar

Hi,
In Spark programing I can use
df.filter(col("transactiontype") === 
"DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total
 Debit Card")).orderBy("transactiondate").show(5)
or
df.filter(col("transactiontype") === 
"DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total
 Debit Card")).sort("transactiondate").show(5)

i get the same results
and i can use both as well
df.ilter(col("transactiontype") === 
"DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total
 Debit Card")).orderBy("transactiondate").sort("transactiondate").show(5)

but the last one takes more time.
what is the use case for both these please. does it make sense to use both?
Thanks

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar

Thanks Mich looking forward to it :) 

On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh 
 wrote:
 

 Hi all,
This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for 
refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, 
Jubilee Line 
If you wish you can register and get more info here
It will be in La Tasca West India Docks Road E14 
and especially if you like Spanish food :)
Regards,



Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 15 July 2016 at 11:06, Joaquin Alzola  wrote:

It is on the 20th (Wednesday) next week. From: Marco Mistroni 
[mailto:mmistr...@gmail.com]
Sent: 15 July 2016 11:04
To: Mich Talebzadeh 
Cc: user @spark ; user 
Subject: Re: Presentation in London: Running Spark on Hive or Hive on Spark Dr 
Mich  do you have any slides or videos available for the presentation you did 
@Canary Wharf?kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich 
Talebzadeh  wrote:
Dear forum members I will be presenting on the topic of "Running Spark on Hive 
or Hive on Spark, your mileage varies" in Future of Data: London 
DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 
PM Place: LondonLocation: One Canada Square, Canary Wharf,  London E14 
5AB.Nearest Underground:  Canary Warf (map)If you are interested please 
register hereLooking forward to seeing those who can make it to have an 
interesting discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any 
and all responsibility for any loss, damage or destruction of data or any other 
property which may arise from relying on this email's technical content is 
explicitly disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction. 
 This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar

Anyone in Spark as well
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking yo 

On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
 

 Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking you

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar

Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive 
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not 
to use Spark? I still don't get why TEZ + LLAP is going to be a better choice 
from what you mentioned?
thanking you 
 

On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The 
challenge for Hive is to achieve this without giving up on the scale and 
flexibility that users depend on. This requires a new approach using a hybrid 
engine that leverages Tez and something new called  LLAP (Live Long and 
Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the 
following:   
   - Caching and data reuse across queries with compressed columnar data 
in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash 
joins
   - High throughput IO using Async IO Elevator with dedicated thread and core 
per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
words what Spark does already and BTW it does not require a daemon running on 
any host. Don't take me wrong. It is interesting but this sounds to me (without 
testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory 
caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything 
about Hive. and they promote Impala but that sounds like a sinking ship these 
days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke  wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
>
> Since data stream tend to be extr

Re: Spark as sql engine on S3

2016-07-08 Thread Ashok Kumar

Hi
As I said we have using Hive asour SQL engine for the datasets but we are 
storing data externally in amazonS3, 
Now you suggested Spark thrift server.

Started Spark thrift server on port 10001 and I have used beeline that accesses 
thrift server. 
Connecting to jdbc:hive2://,host>:10001Connected to: Spark SQL (version 
1.6.1)Driver: Spark Project Core (version 1.6.1)Transaction isolation: 
TRANSACTION_REPEATABLE_READBeeline version 1.6.1 by Apache Hive
Now I just need to access my external tables on S3 as I do it on Hive with 
beeline connected to Hive thrift server?
The advantage is that using Spark SQL will be much faster?
regards

 

On Friday, 8 July 2016, 6:30, ayan guha  wrote:
 

 Yes, it can. 
On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar  wrote:

thanks so basically Spark Thrift Server runs on a port much like beeline that 
uses JDBC to connect to Hive?
Can Spark thrift server access Hive tables?
regards 

On Friday, 8 July 2016, 5:27, ayan guha  wrote:
 

 Spark Thrift Server..works as jdbc server. you can connect to it from any 
jdbc tool like squirrel
On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar  
wrote:

Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up will be greatly appreciated.
Regards



-- 
Best Regards,
Ayan Guha


   



-- 
Best Regards,
Ayan Guha

Re: Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar

thanks so basically Spark Thrift Server runs on a port much like beeline that 
uses JDBC to connect to Hive?
Can Spark thrift server access Hive tables?
regards 

On Friday, 8 July 2016, 5:27, ayan guha  wrote:
 

 Spark Thrift Server..works as jdbc server. you can connect to it from any 
jdbc tool like squirrel
On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar  
wrote:

Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up will be greatly appreciated.
Regards



-- 
Best Regards,
Ayan Guha

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar

Thanks.
Will this presentation recorded as well?
Regards 

On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh 
 wrote:
 

 Dear forum members
I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, 
your mileage varies" in Future of Data: London DetailsOrganized by: 
HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: 
LondonLocation: One Canada Square, Canary Wharf,  London E14 5AB.Nearest 
Underground:  Canary Warf (map) If you are interested please register 
hereLooking forward to seeing those who can make it to have an interesting 
discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.

Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar

Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up will be greatly appreciated.
Regards

ORC or parquet with Spark

2016-07-03 Thread Ashok Kumar

With Spark caching which file format is best to use parquet or ORC
Obviously ORC can be used with Hive. 
My question is whether Spark can use various file, stripe rowset statistics 
stored in ORC file?
Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not 
offer any caching to be faster.
So if Spark ignores the underlying stats for ORC files, does it matter which 
file format to use with Spark.
Thanks

latest version of Spark to work OK as Hive engine

2016-07-02 Thread Ashok Kumar

Hi,
Looking at this presentation Hive on Spark is Blazing Fast ..
Which latest version of Spark can run as an engine for Hive please?
Thanks
P.S. I am aware of  Hive on TEZ but that is not what I am interested here please
Warmest regards

JDBC load into tempTable

2016-06-20 Thread Ashok Kumar

Hi,
I have a SQL server table with 500,000,000 rows with primary key (unique 
clustered index) on ID column
If I load it through JDBC into a DataFrame and register it via 
registerTempTable will the data will be in the order of ID in tempTable?
Thanks

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar

Thank you all sirs
Appreciated Mich your clarification.

On Sunday, 19 June 2016, 19:31, Mich Talebzadeh  
wrote:

 Thanks Jonathan for your points
I am aware of the fact yarn-client and yarn-cluster are both depreciated (still 
work in 1.6.1), hence the new nomenclature.
Bear in mind this is what I stated in my notes:
"YARN Cluster Mode, the Spark driver runs inside an application master process 
which is managed by YARN on the cluster, and the client can go away after 
initiating the application. This is invoked with –master yarn and --deploy-mode 
cluster   
   - YARN Client Mode, the driver runs in the client process, and the 
application master is only used for requesting resources from YARN. 
   -

   - Unlike Spark standalone mode, in which the master’s address is specified 
in the --master parameter, in YARN mode the ResourceManager’s address is picked 
up from the Hadoop configuration. Thus, the --master parameter is yarn. This is 
invoked with --deploy-mode client"

These are exactly from Spark document and I quote
"There are two deploy modes that can be used to launch Spark applications on 
YARN. In cluster mode, the Spark driver runs inside an application master 
process which is managed by YARN on the cluster, and the client can go away 
after initiating the application. 
In client mode, the driver runs in the client process, and the application 
master is only used for requesting resources from YARN.
Unlike Spark standalone and Mesos modes, in which the master’s address is 
specified in the --master parameter, in YARN mode the ResourceManager’s address 
is picked up from the Hadoop configuration. Thus, the --master parameter is 
yarn."
Cheers
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 19 June 2016 at 19:09, Jonathan Kelly  wrote:

Mich, what Jacek is saying is not that you implied that YARN relies on two 
masters. He's just clarifying that yarn-client and yarn-cluster modes are 
really both using the same (type of) master (simply "yarn"). In fact, if you 
specify "--master yarn-client" or "--master yarn-cluster", spark-submit will 
translate that into using a master URL of "yarn" and a deploy-mode of "client" 
or "cluster".

And thanks, Jacek, for the tips on the "less-common master URLs". I had no idea 
that was an option!

~ Jonathan
On Sun, Jun 19, 2016 at 4:13 AM Mich Talebzadeh  
wrote:

Good points but I am an experimentalist 
In Local mode I have this
In local mode with:--master local This will start with one thread or equivalent 
to –master local[1]. Youcan also start by more than one thread by specifying 
the number of threads k in –master local[k]. You can also start using all 
available threads with –master local[*]which in mine would be local[12].
The important thing about Local mode is that number of JVM thrown is controlled 
by you and you can start as many spark-submit as you wish within constraint of 
what you get
${SPARK_HOME}/bin/spark-submit\    
--packagescom.databricks:spark-csv_2.11:1.3.0 \    --driver-memory 
2G \    --num-executors 1 \    --executor-memory 2G \   
 --master local \    --executor-cores 2 \   
 --conf"spark.scheduler.mode=FIFO" \    
--conf"spark.executor.extraJavaOptions=-XX:+PrintGCDetails-XX:+PrintGCTimeStamps"
 \    
--jars/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \ 
   --class"${FILE_NAME}" \    --conf "spark.ui.port=4040” \ 
   ${JAR_FILE} \    >> ${LOG_FILE}
Now that does work fine although some of those parameters are implicit (for 
example cheduler.mode = FIFOR or FAIR and I can start different spark jobs in 
Local mode. Great for testing.
With regard to your comments on Standalone 
Spark Standalone – a simple cluster manager included with Spark that makes it 
easy to set up a cluster.

s/simple/built-inWhat is stated as "included" implies that, i.e. it comes as 
part of running Spark in standalone mode. 
Your other points on YARN cluster mode and YARN client mode
I'd say there's only one YARN master, i.e. --master yarn. You could
 however say where the driver runs, be it on your local machine where
 you executed spark-submit or on one node in a YARN cluster.
Yes that is I believe what the text implied. I would be very surprised if YARN 
as a resource manager relies on two masters :)

HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 19 June 2016 at 11:46, Jacek Laskowski  wrote:

On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh
 wrote:

> Spark Local - Spark runs on the local host. This is the simplest set up and
> best suited for learners who want to understand different concepts of Spark
> and those performing unit testing.

There are a

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar

thank you 
What are the main differences between a local mode and standalone mode. I 
understand local mode does not support cluster. Is that the only difference?
 

On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro  
wrote:
 

 Hi,
In a local mode, spark runs in a single JVM that has a master and one executor 
with `k` 
threads.https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94

// maropu

On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar  
wrote:

Hi,
I have been told Spark in Local mode is simplest for testing. Spark document 
covers little on local mode except the cores used in --master local[k]. 
Where are the the driver program, executor and resources. Do I need to start 
worker threads and how many app I can use safely without exceeding memory 
allocated etc?
Thanking you





-- 
---
Takeshi Yamamuro

Running Spark in local mode

2016-06-19 Thread Ashok Kumar

Hi,
I have been told Spark in Local mode is simplest for testing. Spark document 
covers little on local mode except the cores used in --master local[k]. 
Where are the the driver program, executor and resources. Do I need to start 
worker threads and how many app I can use safely without exceeding memory 
allocated etc?
Thanking you

Re: Running Spark in Standalone or local modes

2016-06-12 Thread Ashok Kumar

Thanks Mich. Great explanation 

On Saturday, 11 June 2016, 22:35, Mich Talebzadeh 
 wrote:
 

 Hi Gavin,
I believe in standalone mode a simple cluster manager is included with Spark 
that makes it easyto set up a cluster.It does not rely on YARN or Mesos.
In summary this is from my notes:
   
   - Spark Local - Spark runs on the localhost. This is the simplest set up and 
best suited for learners who want to understanddifferent concepts of Spark and 
those performing unit testing. 
   - Spark Standalone – a simple cluster managerincluded with Spark that makes 
it easy to set up a cluster.
   - YARN Cluster Mode, the Spark driver runs inside anapplication master 
process which is managed by YARN on the cluster, and theclient can go away 
after initiating the application.
   - Mesos. I have not used it so cannot comment
YARN Client Mode, the driver runs inthe client process, and the application 
master is only used for requestingresources from YARN. UnlikeLocal or Spark 
standalone modes, in which the master’s address isspecified in the --master 
parameter, in YARNmode the ResourceManager’s address is picked up from the 
Hadoop configuration.Thus, the --master parameter is yarn
HTH



Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 11 June 2016 at 22:26, Gavin Yue  wrote:

The standalone mode is against Yarn mode or Mesos mode, which means spark uses 
Yarn or Mesos as cluster managements. 

Local mode is actually a standalone mode which everything runs on the single 
local machine instead of remote clusters.

That is my understanding. 


On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar  
wrote:

Thank you for grateful
I know I can start spark-shell by launching the shell itself
spark-shell 

Now I know that in standalone mode I can also connect to master
spark-shell --master spark://:7077

My point is what are the differences between these two start-up modes for 
spark-shell? If I start spark-shell and connect to master what performance gain 
will I get if any or it does not matter. Is it the same as for spark-submit 
regards 

On Saturday, 11 June 2016, 19:39, Mohammad Tariq  wrote:
 

 Hi Ashok,
In local mode all the processes run inside a single jvm, whereas in standalone 
mode we have separate master and worker processes running in their own jvms.
To quickly test your code from within your IDE you could probable use the local 
mode. However, to get a real feel of how Spark operates I would suggest you to 
have a standalone setup as well. It's just the matter of launching a standalone 
cluster either manually(by starting a master and workers by hand), or by using 
the launch scripts provided with Spark package. 
You can find more on this here.
HTH

|   |
| 
| Tariq, Mohammad
| about.me/mti |

 |
|  |

   |
|   |


On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar  
wrote:

Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar

Thank you for grateful
I know I can start spark-shell by launching the shell itself
spark-shell 

Now I know that in standalone mode I can also connect to master
spark-shell --master spark://:7077

My point is what are the differences between these two start-up modes for 
spark-shell? If I start spark-shell and connect to master what performance gain 
will I get if any or it does not matter. Is it the same as for spark-submit 
regards 

On Saturday, 11 June 2016, 19:39, Mohammad Tariq  wrote:
 

 Hi Ashok,
In local mode all the processes run inside a single jvm, whereas in standalone 
mode we have separate master and worker processes running in their own jvms.
To quickly test your code from within your IDE you could probable use the local 
mode. However, to get a real feel of how Spark operates I would suggest you to 
have a standalone setup as well. It's just the matter of launching a standalone 
cluster either manually(by starting a master and workers by hand), or by using 
the launch scripts provided with Spark package. 
You can find more on this here.
HTH

|   |
| 
| Tariq, Mohammad
| about.me/mti |

 |
|  |

   |
|   |


On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar  
wrote:

Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks

Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar

Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks

Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Ashok Kumar

Anyone can help me with this please

 On Sunday, 5 June 2016, 11:06, Ashok Kumar  wrote:
 

 Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Or add the jar file to $CLASSPATH?
Any advise will be appreciated.
Thanks

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar

Thank you.
I added this as dependency
libraryDependencies += "com.databricks" % "apps.twitter_classifier" % "1.0.0"
That number at the end I chose arbitrary? Is that correct 
Also in my TwitterAnalyzer.scala I added this linw
import com.databricks.apps.twitter_classifier._

Now I am getting this error
[info] Resolving com.databricks#apps.twitter_classifier;1.0.0 ...[warn]  module 
not found: com.databricks#apps.twitter_classifier;1.0.0[warn]  local: 
tried[warn]   
/home/hduser/.ivy2/local/com.databricks/apps.twitter_classifier/1.0.0/ivys/ivy.xml[warn]
  public: tried[warn]   
https://repo1.maven.org/maven2/com/databricks/apps.twitter_classifier/1.0.0/apps.twitter_classifier-1.0.0.pom[info]
 Resolving org.fusesource.jansi#jansi;1.4 ...[warn]  
::[warn]  ::          UNRESOLVED 
DEPENDENCIES         ::[warn]  
::[warn]  :: 
com.databricks#apps.twitter_classifier;1.0.0: not found[warn]  
::[warn][warn]  Note: Unresolved 
dependencies path:[warn]          com.databricks:apps.twitter_classifier:1.0.0 
(/home/hduser/scala/TwitterAnalyzer/build.sbt#L18-19)[warn]            +- 
scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency: 
com.databricks#apps.twitter_classifier;1.0.0: not found
Any ideas?
regards, 

On Sunday, 5 June 2016, 22:22, Jacek Laskowski  wrote:

 On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar
 wrote:

> Now I have added this
>
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
>
> However, I am getting an error
>
>
> error: No implicit for Append.Value[Seq[sbt.ModuleID],
> sbt.impl.GroupArtifactID] found,
>  so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
>                    ^
> [error] Type error in expression
> Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Missing version element, e.g.

libraryDependencies += "com.databricks" %  "apps.twitter_classifier" %
"VERSION_HERE"

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar

Hello for 1, I read the doc as
libraryDependencies += groupID % artifactID % revision
jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep CheckpointDirectory
  com/databricks/apps/twitter_classifier/getCheckpointDirectory.class
  getCheckpointDirectory.class

Now I have added this
libraryDependencies += "com.databricks" %  "apps.twitter_classifier"

However, I am getting an error
error: No implicit for Append.Value[Seq[sbt.ModuleID], 
sbt.impl.GroupArtifactID] found,
  so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
^
[error] Type error in expression
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Any ideas very appreciated
Thanking yoou


 

On Sunday, 5 June 2016, 17:39, Ted Yu  wrote:
 

 For #1, please find examples on the nete.g.
http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html

For #2,
import . getCheckpointDirectory
Cheers
On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar  wrote:

Thank you sir.
At compile time can I do something similar to
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"

I have these
name := "scala"
version := "1.0"
scalaVersion := "2.10.4"
And if I look at jar file i have

jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check  1180 Sun Jun 05 
10:14:36 BST 2016 
com/databricks/apps/twitter_classifier/getCheckpointDirectory.class  1043 Sun 
Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class  1216 Fri Sep 18 09:12:40 
BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class   
615 Fri Sep 18 09:12:40 BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class
two questions please
What do I need to put in libraryDependencies line
and what do I need to add to the top of scala app like
import java.io.Fileimport org.apache.log4j.Loggerimport 
org.apache.log4j.Levelimport ?
Thanks 



 

On Sunday, 5 June 2016, 15:21, Ted Yu  wrote:
 

 At compilation time, you need to declare the dependence on 
getCheckpointDirectory.
At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass 
the jar.
Cheers
On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar  
wrote:

Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar

Thank you sir.
At compile time can I do something similar to
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"

I have these
name := "scala"
version := "1.0"
scalaVersion := "2.10.4"
And if I look at jar file i have

jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check  1180 Sun Jun 05 
10:14:36 BST 2016 
com/databricks/apps/twitter_classifier/getCheckpointDirectory.class  1043 Sun 
Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class  1216 Fri Sep 18 09:12:40 
BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class   
615 Fri Sep 18 09:12:40 BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class
two questions please
What do I need to put in libraryDependencies line
and what do I need to add to the top of scala app like
import java.io.Fileimport org.apache.log4j.Loggerimport 
org.apache.log4j.Levelimport ?
Thanks 



 

On Sunday, 5 June 2016, 15:21, Ted Yu  wrote:
 

 At compilation time, you need to declare the dependence on 
getCheckpointDirectory.
At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass 
the jar.
Cheers
On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar  
wrote:

Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks

Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar

Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks

processing twitter data

2016-05-31 Thread Ashok Kumar

hi all,
i know very little about the subject.
we would like to get streaming data from twitter and facebook.
so questions please may i
   
   - what format is data from twitter. is it jason format
   - can i use spark and spark streaming for analyzing data
   - can data be fed in/streamed via kafka from twitter
   - what would be the optimum batch interval, windows interval and windows 
sliding interval?
   - what is the best method of storing this data in a database. can i use hive 
tables for it and which one is most stuable please

thanking you

Does Spark support updates or deletes on underlying Hive tables

2016-05-30 Thread Ashok Kumar

Hi,
I can do inserts from Spark on Hive tables. How about updates or deletes. They 
are failing when I tried?
Thanking

Using Java in Spark shell

2016-05-24 Thread Ashok Kumar

Hello,
A newbie question.
Is it possible to use java code directly in spark shell without using maven to 
build a jar file?
How can I switch from scala to java in spark shell?
Thanks

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar

Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with 
Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

On Monday, 23 May 2016, 19:01, Mich Talebzadeh  
wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries 
clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may 
have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it 
everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster 
managers. You start master that does the management of resources and you start 
slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through 
spark-submit etc. You may or may not use Hive as your database. You may use 
Hbase via Phoenix etcIf you choose to use Hive as your database, on every host 
of cluster including your master host, you ensure that Hive APIs are installed 
(meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd 
$SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> 
/usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark 
connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT 
RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers 
for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for 
hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 
below. That tells Hive to use spark 1.3.1 as the execution engine. You just 
install spark 1.3.1 on the host just the binary download it is 
/usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  
    hive.execution.engine
    spark
    
  Expects one of [mr, tez, spark].
  Chooses execution engine. Options are: mr (Map reduce, default), tez, 
spark. While MR
  remains the default engine for historical reasons, it is itself a 
historical engine
  and is deprecated in Hive 2 line. It may be removed without further 
warning.
    
    
    spark.home
    /usr/lib/spark-1.3.1-bin-hadoop2
    something
  

 
    hive.merge.sparkfiles
    false
    Merge small files at the end of a Spark DAG 
Transformation
    
    hive.spark.client.future.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for requests from Hive client to remote Spark driver.
    
  
    hive.spark.job.monitor.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for job monitor to get Spark job state.
    
 
  
    hive.spark.client.connect.timeout
    1000ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for remote Spark driver in connecting back to Hive client.
    
  
  
    hive.spark.client.server.connect.timeout
    9ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for handshake between Hive client and remote Spark driver.  
Checked by both processes.
    
  
  
    hive.spark.client.secret.bits
    256
    Number of bits of randomness in the generated secret for 
communication between Hive client and remote Spark driver. Rounded down to the 
nearest multiple of 8.
  
  
    hive.spark.client.rpc.threads
    8
    Maximum number of threads for remote Spark driver's RPC event 
loop.
  
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. 
Now going to Hive itself, you will need to add the correct assembly jar file 
for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies 
./make-distribution.sh --name"hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So Hive uses spark engine by default 
If you want to use mr

Monitoring Spark application progress

2016-05-16 Thread Ashok Kumar

Hi,
I would like to know the approach and tools please to get the full performance 
for a Spark app running through Spark-shell and Spark-sumbit
   
   - Through Spark GUI at 4040?
   - Through OS utilities top, SAR 
   - Through Java tools like jbuilder etc
   - Through integration Spark with monitoring tools.

Thanks

Spark handling spill overs

2016-05-12 Thread Ashok Kumar

Hi,
How one can avoid having Spark spill over after filling the node's memory.
Thanks

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar

Hi Dr Mich,
I will be very keen to have a look at it and review if possible.
Please forward me a copy
Thanking you warmly 

On Thursday, 12 May 2016, 11:08, Mich Talebzadeh 
 wrote:
 

 Hi Al,,

Following the threads in spark forum, I decided to write up on configuration of 
Spark including allocation of resources and configuration of driver, executors, 
threads, execution of Spark apps and general troubleshooting taking into 
account the allocation of resources for Spark applications and OS tools at the 
disposal.
Since the most widespread configuration as I notice is with "Spark Standalone 
Mode", I have decided to write these notes starting with Standalone and later 
on moving to Yarn
   
   - Standalone – a simple cluster managerincluded with Spark that makes it 
easy to set up a cluster.
   - YARN – the resource manager inHadoop 2.

I would appreciate if anyone interested in reading and commenting to get in 
touch with me directly on mich.talebza...@gmail.com so I can send the write-up 
for their review and comments.
Just to be clear this is not meant to be any commercial proposition or anything 
like that. As I seem to get involved with members troubleshooting issues and 
threads on this topic, I thought it is worthwhile writing a note about it to 
summarise the findings for the benefit of the community.
Regards.
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar

great.
So in simplest of forms let us assume that I have a standalone host that runs 
Spark and receives topics from a source say Kafa.
So basically I have one executor, one cache on the node and if my streaming 
data is too much, I anticipate there will not be execution as I don't have 
memory.
On the other hand if I have enough memory allocated then the application will 
start.
Every batch interval data is refreshed from topic on Kafka so I depending on 
the size of my  windowLength  that data will persist in memory for  
windowLength  duration and I will be able to analyze the aggregated data 
through slidingInterval. However, if my windowLength is too big like this case 
of 24 hours, the host may not have enough memory to hold that abount of data 
for 24 hours?
So the only option would be to reduce the size of windowLength or reduce the 
volume of topic? 

On Monday, 9 May 2016, 10:49, Saisai Shao  wrote:
 

 Pease see the inline comments.

On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar  wrote:

Thank you.
So If I create spark streaming then
   
   - The streams will always need to be cached? It cannot be stored in 
persistent storage

You don't need to cache the stream explicitly if you don't have specific 
requirement, Spark will do it for you depends on different streaming sources 
(Kafka or socket).
   
   - The stream data cached will be distributed among all nodes of Spark among 
executors
   - As I understand each Spark worker node has one executor that includes 
cache. So the streaming data is distributed among these work node caches. For 
example if I have 4 worker nodes each cache will have a quarter of data (this 
assumes that cache size among worker nodes is the same.)

Ideally, it will distributed evenly across the executors, also this is target 
for tuning. Normally it depends on several conditions like receiver 
distribution, partition distribution. 

The issue raises if the amount of streaming data does not fit into these 4 
caches? Will the job crash? 

On Monday, 9 May 2016, 10:16, Saisai Shao  wrote:
 

 No, each executor only stores part of data in memory (it depends on how the 
partition are distributed and how many receivers you have). 
For WindowedDStream, it will obviously cache the data in memory, from my 
understanding you don't need to call cache() again.
On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar  wrote:

hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in 
each node?
also in that case why do we need using
dstream.cache()
thanks 

On Monday, 9 May 2016, 9:58, Saisai Shao  wrote:
 

 It depends on you to write the Spark application, normally if data is already 
on the persistent storage, there's no need to be put into memory. The reason 
why Spark Streaming has to be stored in memory is that streaming source is not 
persistent source, so you need to have a place to store the data.
On Mon, May 9, 2016 at 4:43 PM, 李明伟  wrote:

Thanks.What if I use batch calculation instead of stream computing? Do I still 
need that much memory? For example, if the 24 hour data set is 100 GB. Do I 
also need a 100GB RAM to do the one time batch calculation ?




At 2016-05-09 15:14:47, "Saisai Shao"  wrote:
 
For window related operators, Spark Streaming will cache the data into memory 
within this window, in your case your window size is up to 24 hours, which 
means data has to be in Executor's memory for more than 1 day, this may 
introduce several problems when memory is not enough.
On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh  
wrote:

ok terms for Spark Streaming
"Batch interval" is the basic interval at which the system with receive the 
data in batches.
This is the interval set when creating a StreamingContext. For example, if you 
set the batch interval as 300 seconds, then any input DStream will generate 
RDDs of received data at 300 seconds intervals.A window operator is defined by 
two parameters -- WindowDuration / WindowsLength - the length of the window- 
SlideDuration / SlidingInterval - the interval at which the window will slide 
or move forward

Ok so your batch interval is 5 minutes. That is the rate messages are coming in 
from the source.
Then you have these two params
// window length - The duration of the window below that must be multiple of 
batch interval n in = > StreamingContext(sparkConf, Seconds(n))
val windowLength = x =  m * n
// sliding interval - The interval at which the window operation is performed 
in other words data is collected within this "previous interval'
val slidingInterval =  y l x/y = even number
Both the window length and the slidingInterval duration must be multiples of 
the batch interval, as received data is divided into batches of duration "batch 
interval". 
If you want to collect 1 hour data then windowLength =  12 * 5 * 60 seconds If 
you want to collect 24 hour data then windowLength =  24 * 12 * 5 * 60 
You sliding wind

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar

Thank you.
So If I create spark streaming then

- The streams will always need to be cached? It cannot be stored in
persistent storage
- The stream data cached will be distributed among all nodes of Spark among
executors
- As I understand each Spark worker node has one executor that includes
cache. So the streaming data is distributed among these work node caches. For
example if I have 4 worker nodes each cache will have a quarter of data (this
assumes that cache size among worker nodes is the same.)

The issue raises if the amount of streaming data does not fit into these 4
caches? Will the job crash?

On Monday, 9 May 2016, 10:16, Saisai Shao wrote:

No, each executor only stores part of data in memory (it depends on how the
partition are distributed and how many receivers you have).
For WindowedDStream, it will obviously cache the data in memory, from my
understanding you don't need to call cache() again.
On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar wrote:

hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?
also in that case why do we need using
dstream.cache()
thanks

On Monday, 9 May 2016, 9:58, Saisai Shao wrote:

It depends on you to write the Spark application, normally if data is already
on the persistent storage, there's no need to be put into memory. The reason
why Spark Streaming has to be stored in memory is that streaming source is not
persistent source, so you need to have a place to store the data.
On Mon, May 9, 2016 at 4:43 PM, 李明伟 wrote:

Thanks.What if I use batch calculation instead of stream computing? Do I still
need that much memory? For example, if the 24 hour data set is 100 GB. Do I
also need a 100GB RAM to do the one time batch calculation ?

At 2016-05-09 15:14:47, "Saisai Shao" wrote:

For window related operators, Spark Streaming will cache the data into memory
within this window, in your case your window size is up to 24 hours, which
means data has to be in Executor's memory for more than 1 day, this may
introduce several problems when memory is not enough.
On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh
wrote:

ok terms for Spark Streaming
"Batch interval" is the basic interval at which the system with receive the
data in batches.
This is the interval set when creating a StreamingContext. For example, if you
set the batch interval as 300 seconds, then any input DStream will generate
RDDs of received data at 300 seconds intervals.A window operator is defined by
two parameters -- WindowDuration / WindowsLength - the length of the window-
SlideDuration / SlidingInterval - the interval at which the window will slide
or move forward

Ok so your batch interval is 5 minutes. That is the rate messages are coming in
from the source.
Then you have these two params
// window length - The duration of the window below that must be multiple of
batch interval n in = > StreamingContext(sparkConf, Seconds(n))
val windowLength = x = m * n
// sliding interval - The interval at which the window operation is performed
in other words data is collected within this "previous interval'
val slidingInterval = y l x/y = even number
Both the window length and the slidingInterval duration must be multiples of
the batch interval, as received data is divided into batches of duration "batch
interval".
If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If
you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60
You sliding window should be set to batch interval = 5 * 60 seconds. In other
words that where the aggregates and summaries come for your report.
What is your data source here?
HTH

Dr Mich Talebzadeh LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
On 9 May 2016 at 04:19, kramer2...@126.com wrote:

We have some stream data need to be calculated and considering use spark
stream to do it.

We need to generate three kinds of reports. The reports are based on

1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data

The frequency of reports is 5 minutes.

After reading the docs, the most obvious way to solve this seems to set up a
spark stream with 5 minutes interval and two window which are 1 hour and 1
day.

But I am worrying that if the window is too big for one day and one hour. I
do not have much experience on spark stream, so what is the window length in
your environment?

Any official docs talking about this?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar

hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?
also in that case why do we need using
dstream.cache()
thanks

On Monday, 9 May 2016, 9:58, Saisai Shao wrote:

At 2016-05-09 15:14:47, "Saisai Shao" wrote:

Dr Mich Talebzadeh LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
On 9 May 2016 at 04:19, kramer2...@126.com wrote:

We have some stream data need to be calculated and considering use spark
stream to do it.

We need to generate three kinds of reports. The reports are based on

1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data

The frequency of reports is 5 minutes.

After reading the docs, the most obvious way to solve this seems to set up a
spark stream with 5 minutes interval and two window which are 1 hour and 1
day.

But I am worrying that if the window is too big for one day and one hour. I
do not have much experience on spark stream, so what is the window length in
your environment?

Any official docs talking about this?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Ashok Kumar

Thanks Michael as I gathered for now it is a feature. 

On Monday, 25 April 2016, 18:36, Michael Armbrust  
wrote:
 

 When you define a class inside of a method, it implicitly has a pointer to the 
outer scope of the method.  Spark doesn't have access to this scope, so this 
makes it hard (impossible?) for us to construct new instances of that class.
So, define your classes that you plan to use with Spark at the top level.
On Mon, Apr 25, 2016 at 9:36 AM, Mich Talebzadeh  
wrote:

Hi,
I notice buiding with sbt if I define my case class outside of main method like 
below it works

case class Accounts( TransactionDate: String, TransactionType: String, 
Description: String, Value: Double, Balance: Double, AccountName: String, 
AccountNumber : String)

object Import_nw_10124772 {
  def main(args: Array[String]) {
  val conf = new SparkConf().
   setAppName("Import_nw_10124772").
   setMaster("local[12]").
   set("spark.driver.allowMultipleContexts", "true").
   set("spark.hadoop.validateOutputSpecs", "false")
  val sc = new SparkContext(conf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)
  import sqlContext.implicits._
  val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  println ("\nStarted at"); sqlContext.sql("SELECT 
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
").collect.foreach(println)
  //
  // Get a DF first based on Databricks CSV libraries ignore column heading 
because of column called "Type"
  //
  val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
  //df.printSchema
  //
   val a = df.filter(col("Date") > "").map(p => 
Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))

However, if I put that case class with the main method, it throws "No TypeTag 
available for Accounts" error
Apparently when case class is defined inside of the method that it is being 
used, it is not fully defined at that point.
Is this a bug within Spark?
Thanks



Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

Invoking SparkR from Spark shell

2016-04-20 Thread Ashok Kumar

Hi,
I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with 
Spark.
Is there a s hell similar to spark-shell that supports R besides Scala please?

Thanks

Re: Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar

Hello,
Well, Sounds like Andy is implying that Spark can replace Hadoop whereas Mich 
still believes that HDFS is a keeper?
thanks

 

On Thursday, 14 April 2016, 20:40, David Newberger 
 wrote:
 

 #yiv4514430231 #yiv4514430231 -- _filtered #yiv4514430231 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv4514430231 
{font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv4514430231 
#yiv4514430231 p.yiv4514430231MsoNormal, #yiv4514430231 
li.yiv4514430231MsoNormal, #yiv4514430231 div.yiv4514430231MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv4514430231 a:link, 
#yiv4514430231 span.yiv4514430231MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv4514430231 a:visited, #yiv4514430231 
span.yiv4514430231MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv4514430231 
span.yiv4514430231EmailStyle17 {color:#1F497D;}#yiv4514430231 
.yiv4514430231MsoChpDefault {font-size:10.0pt;} _filtered #yiv4514430231 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv4514430231 div.yiv4514430231WordSection1 
{}#yiv4514430231 Can we assume your question is “Will Spark replace Hadoop 
MapReduce?” or do you literally mean replacing the whole of Hadoop?    David    
From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Thursday, April 14, 2016 2:13 PM
To: User
Subject: Spark replacing Hadoop    Hi,    I hear that some saying that Hadoop 
is getting old and out of date and will be replaced by Spark!    Does this make 
sense and if so how accurate is it?    Best

Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar

Hi,
I hear that some saying that Hadoop is getting old and out of date and will be 
replaced by Spark!
Does this make sense and if so how accurate is it?
Best

Spark GUI, Workers and Executors

2016-04-09 Thread Ashok Kumar

On Spark GUI I can see the list of Workers.
I always understood that workers are used by executors.
What is the relationship between workers and executors please. Is it one to one?
Thanks

Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ashok Kumar

Hi,
Anyone has suggestions how to create and copy Hive and Spark tables from 
Production to UAT.
One way would be to copy table data to external files and then move the 
external files to a local target directory and populate the tables in target 
Hive with data.
Is there an easier way of doing so?
thanks

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar

Is simple streaming mean continuous streaming and windows streaming time window?
val ssc = new StreamingContext(sparkConf, Seconds(10))
thanks

Working out SQRT on a list

2016-04-02 Thread Ashok Kumar

Hi 
I like a simple sqrt operation on a list but I don't get the result
scala val l = List (1,5,786,25)l: List[Int] = List(1, 5, 786, 25)
scala> l.map(x => x * x)res42: List[Int] = List(1, 25, 617796, 625)
scala> l.map(x => x * x).sqrt:28: error: value sqrt is not a member of 
List[Int]              l.map(x => x * x).sqrt
Any ideas
Thanks

Spark process creating and writing to a Hive ORC table

2016-03-31 Thread Ashok Kumar

Hello,
How feasible is to use Spark to extract csv files and creates and writes the 
content to an ORC table in a Hive database.
Is Parquet file the best (optimum) format to write to HDFS from Spark app.
Thanks

Re: Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar

Thank you both.
So am I correct that Spark fits in within the application tier in N-tier 
architecture? 

On Tuesday, 29 March 2016, 23:50, Alexander Pivovarov 
 wrote:
 

 Spark is a distributed data processing engine plus distributed in-memory / 
disk data cache 
spark-jobserver provides REST API to your spark applications. It allows you to 
submit jobs to spark and get results in sync or async mode
It also can create long running Spark context to cache RDDs in memory with some 
name (namedRDD) and then use it to serve requests from multiple users. Because 
RDD is in memory response should be super fast (seconds)
https://github.com/spark-jobserver/spark-jobserver


On Tue, Mar 29, 2016 at 2:50 PM, Mich Talebzadeh  
wrote:

Interesting question.
The most widely used application of N-tier is the traditional three-tier 
architecture that has been the backbone of Client-server architecture by having 
presentation layer, application layer and data layer. This is primarily for 
performance, scalability and maintenance. The most profound changes that Big 
data space has introduced to N-tier architecture is the concept of horizontal 
scaling as opposed to the previous tiers that relied on vertical scaling. HDFS 
is an example of horizontal scaling at the data tier by adding more JBODS to 
storage. Similarly adding more nodes to Spark cluster should result in better 
performance. 
Bear in mind that these tiers are at Logical levels which means that there or 
may not be so many so many physical layers. For example multiple virtual 
servers can be hosted on the same physical server.
With regard to Spark, it is effectively a powerful query tools that sits in 
between the presentation layer (say Tableau) and the HDFS or Hive as you 
alluded. In that sense you can think of Spark as part of the application layer 
that communicates with the backend via a number of protocols including the 
standard JDBC. There is rather a blurred vision here whether Spark is a 
database or query tool. IMO it is a query tool in a sense that Spark by itself 
does not have its own storage concept or metastore. Thus it relies on others to 
provide that service.
HTH


Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 29 March 2016 at 22:07, Ashok Kumar  wrote:

Experts,
One of terms used and I hear is N-tier architecture within Big Data used for 
availability, performance etc. I also hear that Spark by means of its query 
engine and in-memory caching fits into middle tier (application layer) with 
HDFS and Hive may be providing the data tier.  Can someone elaborate the role 
of Spark here. For example A Scala program that we write uses JDBC to talk to 
databases so in that sense is Spark a middle tier application?
I hope that someone can clarify this and if so what would the best practice in 
using Spark as middle tier and within Big data.
Thanks

Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar

Experts,
One of terms used and I hear is N-tier architecture within Big Data used for 
availability, performance etc. I also hear that Spark by means of its query 
engine and in-memory caching fits into middle tier (application layer) with 
HDFS and Hive may be providing the data tier.  Can someone elaborate the role 
of Spark here. For example A Scala program that we write uses JDBC to talk to 
databases so in that sense is Spark a middle tier application?
I hope that someone can clarify this and if so what would the best practice in 
using Spark as middle tier and within Big data.
Thanks

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar

Thanks a ton sir. Very helpful 

On Monday, 28 March 2016, 22:36, Mich Talebzadeh 
 wrote:
 

 Pretty straight forward
#!/bin/ksh
DIR="hdfs://:9000/data/stg/accounts/nw/x"
#
## Remove the blank header line from the spreadsheets and compress them
#
echo `date` " ""===  Started Removing blank header line and Compressing all 
csv FILEs"
for FILE in `ls *.csv`
do
  sed '1d' ${FILE} > ${FILE}.tmp
  mv -f ${FILE}.tmp ${FILE}
  /usr/bin/bzip2 ${FILE}
done
#
## Clear out hdfs staging directory
#
echo `date` " ""===  Started deleting old files from hdfs staging directory 
${DIR}"
hdfs dfs -rm -r ${DIR}/*.bz2
echo `date` " ""===  Started Putting bz2 fileS to hdfs staging directory 
${DIR}"
for FILE in `ls *.bz2`
do
  hdfs dfs -copyFromLocal ${FILE} ${DIR}
done
echo `date` " ""===  Checking that all files are moved to hdfs staging 
directory"
hdfs dfs -ls ${DIR}
exit 0HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 28 March 2016 at 22:24, Ashok Kumar  wrote:

Hello Mich
If you accommodate can you please share your approach to steps 1-3 above. 
Best regards 

On Sunday, 27 March 2016, 14:53, Mich Talebzadeh 
 wrote:
 

 Pretty simple as usual it is a combination of ETL and ELT. 
Basically csv files are loaded into staging directory on host, compressed 
before pushing into hdfs
   
   - ETL --> Get rid of the header blank line on the csv files
   - ETL --> Compress the csv files
   - ETL --> Put the compressed CVF files  into hdfs staging directory
   - ELT --> Use databricks to load the csv files
   - ELT --> Spark FP to prcess the csv data
   - ELT --> register it as a temporary table 
   - ELT --> Create an ORC table in a named database in compressed zlib2 format 
in Hive database
   - ELT --> Insert/select from temporary table to Hive table

So the data is stored in an ORC table and one can do whatever analysis using 
Spark, Hive etc


Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 27 March 2016 at 03:05, Koert Kuipers  wrote:

To me this is expected behavior that I would not want fixed, but if you look at 
the recent commits for spark-csv it has one that deals this...On Mar 26, 2016 
21:25, "Mich Talebzadeh"  wrote:


Hi,
I have a standard csv file (saved as csv in HDFS) that has first line of blank 
at the headeras follows
[blank line]
Date, Type, Description, Value, Balance, Account Name, Account Number[blank 
line]22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN 
AE","'638585-60125663",

When I read this file using the following standard
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
it crashes.
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:794)

 If I go and manually delete the first blank line it works OK
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,  
Description: string,  Value: double,  Balance: double,  Account Name: string,  
Account Number: string]

I can easily write a shell script to get rid of blank line. I was wondering if 
databricks does have a flag to get rid of the first blank line in csv file 
format?
P.S. If the file is stored as DOS text file, this problem goes away.
Thanks
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar

Hello Mich
If you accommodate can you please share your approach to steps 1-3 above. 
Best regards 

On Sunday, 27 March 2016, 14:53, Mich Talebzadeh 
 wrote:
 

 Pretty simple as usual it is a combination of ETL and ELT. 
Basically csv files are loaded into staging directory on host, compressed 
before pushing into hdfs
   
   - ETL --> Get rid of the header blank line on the csv files
   - ETL --> Compress the csv files
   - ETL --> Put the compressed CVF files  into hdfs staging directory
   - ELT --> Use databricks to load the csv files
   - ELT --> Spark FP to prcess the csv data
   - ELT --> register it as a temporary table 
   - ELT --> Create an ORC table in a named database in compressed zlib2 format 
in Hive database
   - ELT --> Insert/select from temporary table to Hive table

So the data is stored in an ORC table and one can do whatever analysis using 
Spark, Hive etc


Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 27 March 2016 at 03:05, Koert Kuipers  wrote:

To me this is expected behavior that I would not want fixed, but if you look at 
the recent commits for spark-csv it has one that deals this...On Mar 26, 2016 
21:25, "Mich Talebzadeh"  wrote:


Hi,
I have a standard csv file (saved as csv in HDFS) that has first line of blank 
at the headeras follows
[blank line]
Date, Type, Description, Value, Balance, Account Name, Account Number[blank 
line]22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN 
AE","'638585-60125663",

When I read this file using the following standard
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
it crashes.
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:794)

 If I go and manually delete the first blank line it works OK
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,  
Description: string,  Value: double,  Balance: double,  Account Name: string,  
Account Number: string]

I can easily write a shell script to get rid of blank line. I was wondering if 
databricks does have a flag to get rid of the first blank line in csv file 
format?
P.S. If the file is stored as DOS text file, this problem goes away.
Thanks
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

Re: Finding out the time a table was created

2016-03-25 Thread Ashok Kumar

1.5.2 Ted.
Those two lines I don't know where they come. It finds and gets the table info 
OK
HTH 

On Friday, 25 March 2016, 22:32, Ted Yu  wrote:
 

 Which release of Spark do you use, Mich ?
In master branch, the message is more accurate 
(sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala):
  override def getMessage: String = s"Table $table not found in database $db"


On Fri, Mar 25, 2016 at 3:21 PM, Mich Talebzadeh  
wrote:

You can use DESCRIBE FORMATTED . to get that info.
This is based on the same command in Hive however, it throws two erroneous 
error lines as shown below (don't see them in Hive DESCRIBE ...)
Example
scala> sql("describe formatted test.t14").collect.foreach(println)
16/03/25 22:32:38 ERROR Hive: Table test not found: test.test table not found
16/03/25 22:32:38 ERROR Hive: Table test not found: test.test table not found
[# col_name data_type   comment ]
[    ]
[invoicenumber  int ]
[paymentdate    date    ]
[net    decimal(20,2)   ]
[vat    decimal(20,2)   ]
[total  decimal(20,2)   ]
[    ]
[# Detailed Table Information    ]
[Database:  test ]
[Owner: hduser   ]
[CreateTime:    Fri Mar 25 22:13:44 GMT 2016 ]
[LastAccessTime:    UNKNOWN  ]
[Protect Mode:  None ]
[Retention: 0    ]
[Location:  hdfs://rhes564:9000/user/hive/warehouse/test.db/t14 
 ]
[Table Type:    MANAGED_TABLE    ]
[Table Parameters:   ]
[   COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}]
[   comment from csv file from excel sheet]
[   numFiles    2   ]
[   orc.compress    ZLIB    ]
[   totalSize   1090    ]
[   transient_lastDdlTime   1458944025  ]
[    ]
[# Storage Information   ]
[SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde    ]
[InputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  ]
[OutputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
 ]
[Compressed:    No   ]
[Num Buckets:   -1   ]
[Bucket Columns:    []   ]
[Sort Columns:  []   ]
[Storage Desc Params:    ]
[   serialization.format    1   ]
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 25 March 2016 at 22:12, Ashok Kumar  wrote:

Experts,
I would like to know when a table was created in Hive database using Spark 
shell?
Thanks

Finding out the time a table was created

2016-03-25 Thread Ashok Kumar

Experts,
I would like to know when a table was created in Hive database using Spark 
shell?
Thanks

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar

Thank you again
For
val r = df.filter(col("paid") > "").map(x => 
(x.getString(0),x.getString(1).)

Can you give an example of column expression please like
df.filter(col("paid") > "").col("firstcolumn").getString   ?

 

On Thursday, 24 March 2016, 0:45, Michael Armbrust  
wrote:
 

 You can only use as on a Column expression, not inside of a lambda function.  
The reason is the lambda function is compiled into opaque bytecode that Spark 
SQL is not able to see.  We just blindly execute it.
However, there are a couple of ways to name the columns that come out of a map. 
 Either use a case class instead of a tuple.  Or use .toDF("name1", 
"name2") after the map.
>From a performance perspective, its even better though if you can avoid maps 
>and stick to Column expressions.  The reason is that for maps, we have to 
>actually materialize and object to pass to your function.  However, if you 
>stick to column expression we can actually work directly on serialized data.
On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar  wrote:

thank you sir
sql("select `_1` as firstcolumn from items")

is there anyway one can keep the csv column names using databricks when mapping
val r = df.filter(col("paid") > "").map(x => 
(x.getString(0),x.getString(1).)

can I call example  x.getString(0).as.(firstcolumn) in above when mapping if 
possible so columns will have labels


 

On Thursday, 24 March 2016, 0:18, Michael Armbrust  
wrote:
 

 You probably need to use `backticks` to escape `_1` since I don't think that 
its a valid SQL identifier.
On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar  
wrote:

Gurus,
If I register a temporary table as below
 r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: 
double, _4: double, _5: double]
r.toDF.registerTempTable("items")
sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, 
_2: string, _3: double, _4: double, _5: double]
Is there anyway I can do a select on the first column only
sql("select _1 from items" throws error
Thanking you

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar

thank you sir
sql("select `_1` as firstcolumn from items")

is there anyway one can keep the csv column names using databricks when mapping
val r = df.filter(col("paid") > "").map(x => 
(x.getString(0),x.getString(1).)

can I call example  x.getString(0).as.(firstcolumn) in above when mapping if 
possible so columns will have labels


 

On Thursday, 24 March 2016, 0:18, Michael Armbrust  
wrote:
 

 You probably need to use `backticks` to escape `_1` since I don't think that 
its a valid SQL identifier.
On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar  
wrote:

Gurus,
If I register a temporary table as below
 r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: 
double, _4: double, _5: double]
r.toDF.registerTempTable("items")
sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, 
_2: string, _3: double, _4: double, _5: double]
Is there anyway I can do a select on the first column only
sql("select _1 from items" throws error
Thanking you

calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar

Gurus,
If I register a temporary table as below
 r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: 
double, _4: double, _5: double]
r.toDF.registerTempTable("items")
sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, 
_2: string, _3: double, _4: double, _5: double]
Is there anyway I can do a select on the first column only
sql("select _1 from items" throws error
Thanking you

reading csv file, operation on column or columns

2016-03-20 Thread Ashok Kumar

Gurus,
I would like to read a csv file into a Data Frame but able to rename the column 
name, change a column type from String to Integer or drop the column from 
further analysis before saving data as parquet file?
Thanks

Setting up spark to run on two nodes

2016-03-19 Thread Ashok Kumar

Experts.
Please your valued advice.
I have spark 1.5.2 set up as standalone for now and I have started the master 
as below
start-master.sh

I also have modified config/slave file to have 
# A Spark Worker will be started on each of the machines listed below.
localhostworkerhost

On the localhost I start slave as follows:
start-slave.shspark:localhost:7077 

Questions.
If I want worker process to be started not only on localhost but also workerhost
1) Do I need just to do start-slave.sh on localhost and it will start the 
worker process on other node -> workerhost2) Do I have to runt start-slave.sh 
spark:workerhost:7077 as well locally on workerhost3) On GUI 
http://localhost:4040/environment/ I do not see any reference to worker process 
running on workerhost
Appreciate any help on how to go about starting the master on localhost and 
starting two workers one on localhost and the other on workerhost
Thanking you

shuffle in spark

2016-03-14 Thread Ashok Kumar

 experts,
please I need to understand how shuffling works in Spark and which parameters 
influence it. 
I am sorry but my knowledge of shuffling is very limited. Need a practical use 
case if you can.
regards

Spark configuration with 5 nodes

2016-03-10 Thread Ashok Kumar

  Hi,
We intend  to use 5servers which will be utilized for building Bigdata Hadoop 
data warehousesystem (not using any propriety distribution like Hortonworks or 
Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 
cores, Ubuntu Linux servers. Hadoop will be installed on all theservers/nodes. 
Server 1 will be used for NameNode plus DataNode as well. Server2 will be  used 
for standby NameNode& DataNode. The rest of the servers will be used as 
DataNodes..Now we would like to install Spark on each servers tocreate Spark 
cluster. Is that the good thing to do or we should buy additional hardware for 
Spark (minding cost here) or simply do we require additionalmemory to 
accommodate Spark as well please. In that case how much memory for each Spark 
node would you recommend?

thanks all

HBASE

2016-03-09 Thread Ashok Kumar

 Hi Gurus,
I am relatively new to Big Data and know some about Spark and Hive.
I was wondering do I need to pick up skills on Hbase as well. I am not sure how 
it works but know that it is kind of columnar NoSQL database.
I know it is good to know something new in Big Data space. Just wondering if I 
am better off spending efforts on something else please.
Appreciate any advice
Regards

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar

Thanks great
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toSeq.toDF("weights","value").orderBy(desc("value")).collect.foreach(println)
 

On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu 
 wrote:
 

 For Array, you need to all `toSeq` at first. Scala can convert Array to 
ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` 
explicitly.
On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar  
wrote:

Thank you sir
This works OKimport sqlContext.implicits._
val weights = Seq(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toDF("weights","value").orderBy(desc("value")).collect.foreach(println)
Please why Array did not work? 

On Tuesday, 1 March 2016, 8:51, Jeff Zhang  wrote:
 

 Change Array to Seq and import sqlContext.implicits._


On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar  
wrote:

 Hi,
I have this
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toDF("weights","value")
I want to convert the Array to DF but I get thisor
weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), 
(g,6))
:33: error: value toDF is not a member of Array[(String, Int)]
  weights.toDF("weights","value")
I want to label columns and print out the contents in value order please I 
don't know why I am getting this error
Thanks




-- 
Best Regards

Jeff Zhang

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar

Thank you sir
This works OKimport sqlContext.implicits._
val weights = Seq(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toDF("weights","value").orderBy(desc("value")).collect.foreach(println)
Please why Array did not work? 

On Tuesday, 1 March 2016, 8:51, Jeff Zhang  wrote:
 

 Change Array to Seq and import sqlContext.implicits._


On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar  
wrote:

 Hi,
I have this
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toDF("weights","value")
I want to convert the Array to DF but I get thisor
weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), 
(g,6))
:33: error: value toDF is not a member of Array[(String, Int)]
  weights.toDF("weights","value")
I want to label columns and print out the contents in value order please I 
don't know why I am getting this error
Thanks




-- 
Best Regards

Jeff Zhang

Converting array to DF

2016-03-01 Thread Ashok Kumar

 Hi,
I have this
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), 
("g", 6))
weights.toDF("weights","value")
I want to convert the Array to DF but I get thisor
weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), 
(g,6))
:33: error: value toDF is not a member of Array[(String, Int)]
  weights.toDF("weights","value")
I want to label columns and print out the contents in value order please I 
don't know why I am getting this error
Thanks

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread Ashok Kumar

Thank you all for valuable advice. Much appreciated
Best 

On Sunday, 28 February 2016, 21:48, Ashok Kumar  
wrote:
 

   Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge
I very much like to skill myself on transformation and action methods.
FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.
Warmest regards

Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ashok Kumar

  Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge
I very much like to skill myself on transformation and action methods.
FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.
Warmest regards

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar

no particular reason. just wanted to know if there was another way as well.
thanks 

On Saturday, 27 February 2016, 22:12, Yin Yang  wrote:
 

 Is there particular reason you cannot use temporary table ?
Thanks
On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar  wrote:

Thank you sir.
Can one do this sorting without using temporary table if possible?
Best 

On Saturday, 27 February 2016, 18:50, Yin Yang  wrote:
 

 scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", 
"b").registerTempTable("test")
scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: 
org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]
scala> df.show++|struct(id, b, a)|+----+|       
[2,foo,a]||      [1,test,b]|++
On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar  
wrote:

 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar

Thank you sir.
Can one do this sorting without using temporary table if possible?
Best 

On Saturday, 27 February 2016, 18:50, Yin Yang  wrote:
 

 scala>  Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", 
"b").registerTempTable("test")
scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: 
org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]
scala> df.show++|struct(id, b, a)|+----+|       
[2,foo,a]||      [1,test,b]|++
On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar  
wrote:

 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks

Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar

 Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5  entries say arr("A",20), 
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column 
so I will have arr("A",20), arr("E",19), arr("C",18) 
What is the best way of doing this using arrays only?
Thanks

Clarification on RDD

2016-02-26 Thread Ashok Kumar

 Hi,
Spark doco says
Spark’s primary abstraction is a distributed collection of items called a 
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs
example:
val textFile = sc.textFile("README.md")

my question is when RDD is created like above from a file stored on HDFS, does 
that mean that data is distributed among all the nodes in the cluster or data 
from the md file is copied to each node of the cluster so each node has 
complete copy of data? Has the data is actually moved around or data is not 
copied over until an action like COUNT() is performed on RDD? 
Thanks

d.filter("id in max(id)")

2016-02-25 Thread Ashok Kumar

 Hi,
How can I make that work?
val d = HiveContext.table("table")
select * from table where ID = MAX(ID) from table
Thanks

select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Ashok Kumar

 Hi,
What is the equivalent of this in Spark please
select * from mytable where column1 in (select max(column1) from mytable)
Thanks

Filter on a column having multiple values

2016-02-24 Thread Ashok Kumar

 Hi,
I would like to do the following
select count(*) from  where column1 in (1,5))
I define
scala> var t = HiveContext.table("table")
This workst.filter($"column1" ===1)
How can I expand this to have column1  for both 1 and 5 please?
thanks

Re: Execution plan in spark

2016-02-24 Thread Ashok Kumar

looks useful thanks 

On Wednesday, 24 February 2016, 9:42, Yin Yang  wrote:
 

 Is the following what you were looking for ?
    sqlContext.sql("""    CREATE TEMPORARY TABLE partitionedParquet    USING 
org.apache.spark.sql.parquet    OPTIONS (      path '/tmp/partitioned'    )""")
    table("partitionedParquet").explain(true)
On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kumar  
wrote:

 Gurus,
Is there anything like explain in Spark to see the execution plan in functional 
programming?
warm regards

Execution plan in spark

2016-02-24 Thread Ashok Kumar

 Gurus,
Is there anything like explain in Spark to see the execution plan in functional 
programming?
warm regards

Re: install databricks csv package for spark

2016-02-19 Thread Ashok Kumar



great thank you 

On Friday, 19 February 2016, 15:33, Holden Karau  
wrote:
 

 So with --packages to spark-shell and spark-submit Spark will automatically 
fetch the requirements from maven. If you want to use an explicit local jar you 
can do that with the --jars syntax. You might find 
http://spark.apache.org/docs/latest/submitting-applications.html useful.
On Fri, Feb 19, 2016 at 7:26 AM, Ashok Kumar  
wrote:

 Hi,
I downloaded the zipped csv libraries from databricks/spark-csv
|   |
|   |  |   |   |   |   |   |
| databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames |
|  |
| View on github.com | Preview by Yahoo |
|  |
|   |


Now I have a directory created called spark-csv-master.  I would like to use 
this in spark-shell with ---packgage like below
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Do I need to use mvn to create a zipped file to start. or may be added to spark 
CLASSPATH. What is needful here please to make it work 
thanks




-- 
Cell : 425-233-8271Twitter: https://twitter.com/holdenkarau

install databricks csv package for spark

2016-02-19 Thread Ashok Kumar

 Hi,
I downloaded the zipped csv libraries from databricks/spark-csv
|   |
|   |  |   |   |   |   |   |
| databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames |
|  |
| View on github.com | Preview by Yahoo |
|  |
|   |


Now I have a directory created called spark-csv-master.  I would like to use 
this in spark-shell with ---packgage like below
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Do I need to use mvn to create a zipped file to start. or may be added to spark 
CLASSPATH. What is needful here please to make it work 
thanks

1 2 >

1 - 100 of 107 matches

Mail list logo