Hi Gurus,
The parameter spark.yarn.executor.memoryOverhead is explained as below:
spark.yarn.executor.memoryOverhead
executorMemory * 0.10, with minimum of 384
The amount of off-heap memory (in megabytes) to be allocated per executor. This
is memory that accounts for things like VM overheads,
Hi,
How does spark handle compressed files? Are they optimizable in terms of using
multiple RDDs against the file pr one needs to uncompress them beforehand say
bz type files.
thanks
Gurus,
I understand when we create RDD in Spark it is immutable.
So I have few points please:
- When RDD is created that is just a pointer. Not most Spark operations it
is lazy not consumed until a collection operation done that affects RDD?
- When a DF is created from RDD does that
lusters and the type of processing
you are trying to do.
On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not
tried more than one topic.
Thanks
Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not
tried more than one topic.
Thanks
orking with straight spark or referring to GraphX?
Thank You,
Irving Duran
On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi,
I am a bit confused between Edge node, Edge server and gateway node in Spark.
Do these mean the same thing?
How does one set up
Hi,
I am a bit confused between Edge node, Edge server and gateway node in Spark.
Do these mean the same thing?
How does one set up an Edge node to be used in Spark? Is this different from
Edge node for Hadoop please?
Thanks
e-spark
Follow me at https://twitter.com/jaceklaskowski
On Sun, Feb 5, 2017 at 10:11 AM, Ashok Kumar
<ashok34...@yahoo.com.invalid> wrote:
> Hello,
>
> What are the practiced High Availability/DR operations for Spark cluster at
> the moment. I am specially interested if YARN i
Hello,
What are the practiced High Availability/DR operations for Spark cluster at the
moment. I am specially interested if YARN is used as the resource manager.
Thanks
You are very kind Sir
On Sunday, 30 October 2016, 16:42, Devopam Mittra wrote:
+1
Thanks and regards
Devopam
On 30 Oct 2016 9:37 pm, "Mich Talebzadeh" wrote:
Enjoy the festive season.
Regards,
Dr Mich Talebzadeh LinkedIn
Can one design a fast pipeline with Kafka, Spark streaming and Hbase or
something similar?
On Friday, 30 September 2016, 17:17, Mich Talebzadeh
wrote:
I have designed this prototype for a risk business. Here I would like to
discuss issues with batch
Hi,
As a learner I appreciate if you have typical Spark interview questions for
Spark/Scala junior roles that you can please forward to me.
I will be very obliged
liable for any monetary damages arising from suchloss, damage or
destruction.
On 7 September 2016 at 11:39, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:
Hi,
A bit confusing to me
How many layers involved in DStream.foreachRDD.
Do I need to loop over it more than once? I mean DStream.for
Hi,
A bit confusing to me
How many layers involved in DStream.foreachRDD.
Do I need to loop over it more than once? I mean DStream.foreachRDD{ rdd = > }
I am trying to get individual lines in RDD.
Thanks
Any help on this warmly appreciated.
On Tuesday, 6 September 2016, 21:31, Ashok Kumar
<ashok34...@yahoo.com.INVALID> wrote:
Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I
Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I sent 5 prices in each batch interval. In the loop code this
is what is hapening
dstream.foreachRDD { rdd => val x= rdd.count
i
rm in the array, convert it to your desired
data type and then use filter.
On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:
Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148
I want to filter anything > 50
ter first map, you will get RDD of arrays. What
is your expected outcome of 2nd map?
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[
Ar
:
Basic error, you get back an RDD on transformations like
map.sc.textFile("filename").map(x => x.split(",")
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34...@yahoo.com.invalid> wrote:
Hi,
I have a text file as below that I read in
74,20160905-
Hi,
I have a text file as below that I read in
Hi,
What are practical differences between the new Data set in Spark 2 and the
existing DataFrame.
Has Dataset replaced Data Frame and what advantages it has if I use Data Frame
instead of Data Frame.
Thanks
Hi,
There are design patterns that use Spark extensively. I am new to this area so
I would appreciate if someone explains where Spark fits in especially within
faster or streaming use case.
What are the best practices involving Spark. Is it always best to deploy it for
processing engine,
For
Hi,
for small to medium size clusters I think Spark Standalone mode is a good
choice.
We are contemplating moving to Yarn as our cluster grows.
What are the pros and cons of using either please. Which one offers the best
Thanking you
//talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destructionof data or any other property which may arise from relying
on this email's technical content is explicitly disclaimed.The author will in
no case be liable for any monet
icitly disclaimed.The author will in
no case be liable for any monetary damages arising from suchloss, damage or
destruction.
On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:
Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these
ta
Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple
connections from Spark
thanking you
Hi
I would like to know the exact definition for these three parameters
num-executors
executor-memory
executor-cores
for local, standalone and yarn modes
I have looked at on-line doc but not convinced if I understand them correct.
Thanking you
Hi,
in the following Window spec I want orderBy ("") to be displayed in
descending order please
val W = Window.partitionBy("col1").orderBy("col2")
If I Do
val W = Window.partitionBy("col1").orderBy("col2".desc)
It throws error
console>:26: error: value desc is not a member of String
How can I
Hi,
In Spark programing I can use
df.filter(col("transactiontype") ===
"DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total
Debit Card")).orderBy("transactiondate").show(5)
or
df.filter(col("transactiontype") ===
Thanks Mich looking forward to it :)
On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh
wrote:
Hi all,
This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for
refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station,
Any expert advice warmly acknowledged..
thanking yo
On Monday, 11 July 2016, 17:24, Ashok Kumar <ashok34...@yahoo.com> wrote:
Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not
user friendly/MongodDB as
Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not
to use Spark? I still don't get why TEZ + LLAP
age is that using Spark SQL will be much faster?
regards
On Friday, 8 July 2016, 6:30, ayan guha <guha.a...@gmail.com> wrote:
Yes, it can.
On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
thanks so basically Spark Thrift Server runs on a port mu
ou can connect to it from any
jdbc tool like squirrel
On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up wil
Thanks.
Will this presentation recorded as well?
Regards
On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh
wrote:
Dear forum members
I will be presenting on the topic of "Running Spark on Hive or Hive on Spark,
your mileage varies" in Future of Data: London
Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up will be greatly appreciated.
Regards
With Spark caching which file format is best to use parquet or ORC
Obviously ORC can be used with Hive.
My question is whether Spark can use various file, stripe rowset statistics
stored in ORC file?
Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not
offer any
Hi,
Looking at this presentation Hive on Spark is Blazing Fast ..
Which latest version of Spark can run as an engine for Hive please?
Thanks
P.S. I am aware of Hive on TEZ but that is not what I am interested here please
Warmest regards
Thank you all sirs
Appreciated Mich your clarification.
On Sunday, 19 June 2016, 19:31, Mich Talebzadeh
wrote:
Thanks Jonathan for your points
I am aware of the fact yarn-client and yarn-cluster are both depreciated (still
work in 1.6.1), hence the new
single JVM that has a master and one executor
with `k`
threads.https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94
// maropu
On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi,
Hi,
I have been told Spark in Local mode is simplest for testing. Spark document
covers little on local mode except the cores used in --master local[k].
Where are the the driver program, executor and resources. Do I need to start
worker threads and how many app I can use safely without
st Yarn mode or Mesos mode, which means spark uses
Yarn or Mesos as cluster managements.
Local mode is actually a standalone mode which everything runs on the single
local machine instead of remote clusters.
That is my understanding.
On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar <ashok34...@ya
11:38 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks
Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks
Anyone can help me with this please
On Sunday, 5 June 2016, 11:06, Ashok Kumar <ashok34...@yahoo.com> wrote:
Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and
method. I intend to add my own c
cala/TwitterAnalyzer/build.sbt#L18-19)[warn] +-
scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency:
com.databricks#apps.twitter_classifier;1.0.0: not found
Any ideas?
regards,
On Sunday, 5 June 2016, 22:22, Jacek Laskowski <ja...@japila.pl> wrote:
On Sun,
r #1, please find examples on the nete.g.
http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html
For #2,
import . getCheckpointDirectory
Cheers
On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:
Thank you sir.
At compile time can I do something similar to
libr
PSHOT.jar' to pass
the jar.
Cheers
On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and
method. I intend to add my own classes a
Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and
method. I intend to add my own classes and methods as I expand and make
references to these classes and methods in my other apps
class getCheckpointDirectory { def
hi all,
i know very little about the subject.
we would like to get streaming data from twitter and facebook.
so questions please may i
- what format is data from twitter. is it jason format
- can i use spark and spark streaming for analyzing data
- can data be fed in/streamed via
Hi,
I can do inserts from Spark on Hive tables. How about updates or deletes. They
are failing when I tried?
Thanking
Hello,
A newbie question.
Is it possible to use java code directly in spark shell without using maven to
build a jar file?
How can I switch from scala to java in spark shell?
Thanks
Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with
Spark as an engine. What Spark processes are used to make this work?
Thanking you
On Monday, 23 May 2016, 19:01, Mich Talebzadeh
wrote:
Have a look at this thread
Dr Mich
Hi,
I would like to know the approach and tools please to get the full performance
for a Spark app running through Spark-shell and Spark-sumbit
- Through Spark GUI at 4040?
- Through OS utilities top, SAR
- Through Java tools like jbuilder etc
- Through integration Spark with
Hi,
How one can avoid having Spark spill over after filling the node's memory.
Thanks
Hi Dr Mich,
I will be very keen to have a look at it and review if possible.
Please forward me a copy
Thanking you warmly
On Thursday, 12 May 2016, 11:08, Mich Talebzadeh
wrote:
Hi Al,,
Following the threads in spark forum, I decided to write up on
, 10:49, Saisai Shao <sai.sai.s...@gmail.com> wrote:
Pease see the inline comments.
On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
Thank you.
So If I create spark streaming then
- The streams will always need to be cached? It cannot be stored in
cache the data in memory, from my
understanding you don't need to call cache() again.
On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?
also in that case why do w
hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?
also in that case why do we need using
dstream.cache()
thanks
On Monday, 9 May 2016, 9:58, Saisai Shao wrote:
It depends on you to write the Spark application,
Thanks Michael as I gathered for now it is a feature.
On Monday, 25 April 2016, 18:36, Michael Armbrust
wrote:
When you define a class inside of a method, it implicitly has a pointer to the
outer scope of the method. Spark doesn't have access to this scope, so
Hi,
I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with
Spark.
Is there a s hell similar to spark-shell that supports R besides Scala please?
Thanks
y mean replacing the whole of Hadoop? David
From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Thursday, April 14, 2016 2:13 PM
To: User
Subject: Spark replacing Hadoop Hi, I hear that some saying that Hadoop
is getting old and out of date and will be replaced by Spark!
Hi,
I hear that some saying that Hadoop is getting old and out of date and will be
replaced by Spark!
Does this make sense and if so how accurate is it?
Best
On Spark GUI I can see the list of Workers.
I always understood that workers are used by executors.
What is the relationship between workers and executors please. Is it one to one?
Thanks
Hi,
Anyone has suggestions how to create and copy Hive and Spark tables from
Production to UAT.
One way would be to copy table data to external files and then move the
external files to a local target directory and populate the tables in target
Hive with data.
Is there an easier way of doing
Is simple streaming mean continuous streaming and windows streaming time window?
val ssc = new StreamingContext(sparkConf, Seconds(10))
thanks
Hi
I like a simple sqrt operation on a list but I don't get the result
scala val l = List (1,5,786,25)l: List[Int] = List(1, 5, 786, 25)
scala> l.map(x => x * x)res42: List[Int] = List(1, 25, 617796, 625)
scala> l.map(x => x * x).sqrt:28: error: value sqrt is not a member of
List[Int]
Hello,
How feasible is to use Spark to extract csv files and creates and writes the
content to an ORC table in a Hive database.
Is Parquet file the best (optimum) format to write to HDFS from Spark app.
Thanks
016 at 22:07, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:
Experts,
One of terms used and I hear is N-tier architecture within Big Data used for
availability, performance etc. I also hear that Spark by means of its query
engine and in-memory caching fits into middle tier (application
Experts,
One of terms used and I hear is N-tier architecture within Big Data used for
availability, performance etc. I also hear that Spark by means of its query
engine and in-memory caching fits into middle tier (application layer) with
HDFS and Hive may be providing the data tier. Can
Hello Mich
If you accommodate can you please share your approach to steps 1-3 above.
Best regards
On Sunday, 27 March 2016, 14:53, Mich Talebzadeh
wrote:
Pretty simple as usual it is a combination of ETL and ELT.
Basically csv files are loaded into staging
e/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
On 25 March 2016 at 22:12, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:
Experts,
I would like to know when a table was created in Hive database using Spark
shell?
Thanks
Experts,
I would like to know when a table was created in Hive database using Spark
shell?
Thanks
umn expressions. The reason is that for maps, we have to
>actually materialize and object to pass to your function. However, if you
>stick to column expression we can actually work directly on serialized data.
On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar <ashok34...@yahoo.com> w
.(firstcolumn) in above when mapping if
possible so columns will have labels
On Thursday, 24 March 2016, 0:18, Michael Armbrust <mich...@databricks.com>
wrote:
You probably need to use `backticks` to escape `_1` since I don't think that
its a valid SQL identifier.
On Wed, Mar
Gurus,
If I register a temporary table as below
r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
double, _4: double, _5: double]
r.toDF.registerTempTable("items")
sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string,
_2: string, _3: double, _4:
Gurus,
I would like to read a csv file into a Data Frame but able to rename the column
name, change a column type from String to Integer or drop the column from
further analysis before saving data as parquet file?
Thanks
Experts.
Please your valued advice.
I have spark 1.5.2 set up as standalone for now and I have started the master
as below
start-master.sh
I also have modified config/slave file to have
# A Spark Worker will be started on each of the machines listed below.
localhostworkerhost
On the localhost
experts,
please I need to understand how shuffling works in Spark and which parameters
influence it.
I am sorry but my knowledge of shuffling is very limited. Need a practical use
case if you can.
regards
Hi,
We intend to use 5servers which will be utilized for building Bigdata Hadoop
data warehousesystem (not using any propriety distribution like Hortonworks or
Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16
cores, Ubuntu Linux servers. Hadoop will be
Hi Gurus,
I am relatively new to Big Data and know some about Spark and Hive.
I was wondering do I need to pick up skills on Hbase as well. I am not sure how
it works but know that it is kind of columnar NoSQL database.
I know it is good to know something new in Big Data space. Just wondering if
On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu
<shixi...@databricks.com> wrote:
For Array, you need to all `toSeq` at first. Scala can convert Array to
ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq`
explicitly.
On Tue, Mar 1, 2016 at 1:02 AM, Ashok
Hi,
I have this
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4),
("g", 6))
weights.toDF("weights","value")
I want to convert the Array to DF but I get thisor
weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4),
(g,6))
:33: error: value
Thank you all for valuable advice. Much appreciated
Best
On Sunday, 28 February 2016, 21:48, Ashok Kumar <ashok34...@yahoo.com>
wrote:
Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for
beginner to moderate knowledge
I very much like to skill
Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for
beginner to moderate knowledge
I very much like to skill myself on transformation and action methods.
FYI, I have already looked at examples on net. However, some of them not clear
at least to me.
Warmest
no particular reason. just wanted to know if there was another way as well.
thanks
On Saturday, 27 February 2016, 22:12, Yin Yang <yy201...@gmail.com> wrote:
Is there particular reason you cannot use temporary table ?
Thanks
On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar &l
;a",
"b").registerTempTable("test")
scala> val df = sql("SELECT struct(id, b, a) from test order by b")df:
org.apache.spark.sql.DataFrame = [struct(id, b, a): struct]
scala> df.show++|struct(id, b, a)|+--------+|
[2,foo,a]||
Hello,
I like to be able to solve this using arrays.
I have two dimensional array of (String,Int) with 5 entries say arr("A",20),
arr("B",13), arr("C", 18), arr("D",10), arr("E",19)
I like to write a small code to order these in the order of highest Int column
so I will have arr("A",20),
Hi,
Spark doco says
Spark’s primary abstraction is a distributed collection of items called a
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop
InputFormats (such as HDFS files) or by transforming other RDDs
example:
val textFile = sc.textFile("README.md")
my question is
Hi,
How can I make that work?
val d = HiveContext.table("table")
select * from table where ID = MAX(ID) from table
Thanks
Hi,
What is the equivalent of this in Spark please
select * from mytable where column1 in (select max(column1) from mytable)
Thanks
Hi,
I would like to do the following
select count(*) from where column1 in (1,5))
I define
scala> var t = HiveContext.table("table")
This workst.filter($"column1" ===1)
How can I expand this to have column1 for both 1 and 5 please?
thanks
ath '/tmp/partitioned' )""")
table("partitionedParquet").explain(true)
On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Gurus,
Is there anything like explain in Spark to see the execution plan in functional
programming?
warm regards
Gurus,
Is there anything like explain in Spark to see the execution plan in functional
programming?
warm regards
--jars syntax. You might find
http://spark.apache.org/docs/latest/submitting-applications.html useful.
On Fri, Feb 19, 2016 at 7:26 AM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:
Hi,
I downloaded the zipped csv libraries from databricks/spark-csv
| |
| | | | | | | |
| data
Hi,
I downloaded the zipped csv libraries from databricks/spark-csv
| |
| | | | | | | |
| databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames |
| |
| View on github.com | Preview by Yahoo |
| |
| |
Now I have a directory created called
Hi,
class body thanks
On Friday, 19 February 2016, 11:23, Ted Yu <yuzhih...@gmail.com> wrote:
Can you clarify your question ?
Did you mean the body of your class ?
On Feb 19, 2016, at 4:43 AM, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote:
Hi,
If I define a class in Sca
Hi,
If I define a class in Scala like
case class(col1: String, col2:Int,...)
and it is created how would I be able to see its description anytime
Thanks
Gurus,
What are the main differences between a Resilient Distributed Data (RDD) and
Data Frame (DF)
Where one can use RDD without transforming it to DF?
Regards and obliged
Gurus,
I am trying to run some examples given under directory examples
spark/examples/src/main/scala/org/apache/spark/examples/
I am trying to run HdfsTest.scala
However, when I run HdfsTest.scala against spark shell it comes back with error
Spark context available as sc.
SQL context available
1 - 100 of 102 matches
Mail list logo