Hi,
I don't get a good understanding how RDD lineage works, so I would ask whether
spark provides a unit test in the code base to illustrate how RDD lineage works.
If there is, What's the class name is it?
Thanks!
bit1...@163.com
Thanks TD and Zhihong for the guide. I will check it
bit1...@163.com
From: Tathagata Das
Date: 2015-07-31 12:27
To: Ted Yu
CC: bit1...@163.com; user
Subject: Re: How RDD lineage works
You have to read the original Spark paper to understand how RDD lineage works.
https://www.cs.berkeley.edu
that partition. Thus, lost data can be recovered, often quite quickly,
without requiring costly replication.
bit1...@163.com
From: bit1...@163.com
Date: 2015-07-31 13:11
To: Tathagata Das; yuzhihong
CC: user
Subject: Re: Re: How RDD lineage works
Thanks TD and Zhihong for the guide. I
heavyOpRDD = rdd.map(squareWithHeavyOp)
heavyOpRDD.checkpoint()
heavyOpRDD.foreach(println)
println(Job 0 has been finished, press ENTER to do job 1)
readLine()
heavyOpRDD.foreach(println)
}
}
bit1...@163.com
with fewer cores, but I didn't get a chance to
try/test it.
Thanks.
bit1...@163.com
Thanks Shixiong for the reply.
Yes, I confirm that the file exists there ,simply checks with ls -l
/data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar
bit1...@163.com
From: Shixiong Zhu
Date: 2015-07-06 18:41
To: bit1...@163.com
CC: user
Subject: Re
(DriverRunner.scala:72)
bit1...@163.com
and the received records are many more than
processed records, I can't understand why the total delay or scheduling day is
not obvious(5 secs) here.
Can someone help explain what clues from this UI?
Thanks.
bit1...@163.com
I am kind of consused about when cached RDD will unpersist its data. I know we
can explicitly unpersist it with RDD.unpersist ,but can it be unpersist
automatically by the spark framework?
Thanks.
bit1...@163.com
Hi,
I am using spark1.3.1, and have 2 receivers,
On the web UI, I can only see the total records received by all these 2
receivers, but I can't figure out the records received by individual receiver?
Not sure whether the information is shown on the UI in spark1.4.
bit1...@163.com
it
bit1...@163.com
Hi, Akhil,
Thank you for the explanation!
bit1...@163.com
From: Akhil Das
Date: 2015-06-23 16:29
To: bit1...@163.com
CC: user
Subject: Re: What does [Stage 0: (0 + 2) / 2] mean on the console
Well, you could that (Stage information) is an ASCII representation of the
WebUI (running on port
spark.scopecompile/spark.scope
/properties
/profile
profile
idClusterRun/id
properties
spark.scopeprovided/spark.scope
/properties
/profile
bit1...@163.com
From: prajod.vettiyat...@wipro.com
Date: 2015-06-19 15:22
To: bit1...@163.com; ak...@sigmoidanalytics.com
CC: user@spark.apache.org
Subject
Sure, Thanks Projod for the detailed steps!
bit1...@163.com
From: prajod.vettiyat...@wipro.com
Date: 2015-06-19 16:56
To: bit1...@163.com; ak...@sigmoidanalytics.com
CC: user@spark.apache.org
Subject: RE: RE: Build spark application into uber jar
Multiple maven profiles may be the ideal way
, then it will be at most once semantics?
bit1...@163.com
From: Haopu Wang
Date: 2015-06-19 18:47
To: Enno Shioji; Tathagata Das
CC: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan
Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh
Kavuri; Spark Enthusiast
Thank you for the reply.
Run the application locally means that I run the application in my IDE with
master as local[*].
When spark stuff is marked as provided, then I can't run it because the spark
stuff is missing.
So, how do you work around this? Thanks!
bit1...@163.com
From
. From the user end, since tasks may process already processed data, user end
should detect that some data has already been processed,eg,
use some unique ID.
Not sure if I have understood correctly.
bit1...@163.com
From: prajod.vettiyat...@wipro.com
Date: 2015-06-18 16:56
To: jrpi
!
bit1...@163.com
Could someone help explain what happens that leads to the Task not serializable
issue?
Thanks.
bit1...@163.com
From: bit1...@163.com
Date: 2015-06-08 19:08
To: user
Subject: Wired Problem: Task not serializable[Spark Streaming]
Hi,
With the following simple code, I got an exception
.
BTW, BlockManagerMaster is there, it makes no sense that BlockManagerWorker is
gone.
bit1...@163.com
, in my opinion it
should be about 600M * 2, Looks some compression happens under the scene or
something else?
Thanks!
bit1...@163.com
Hi,
I am looking for some articles/blogs on the topic about how spark handles the
various failures,such as Driver,Worker,Executor, Task..etc
Are there some articles/blogs on this topic? Detailes into source code would be
the best.
Thanks very much!
bit1...@163.com
good response times, without waiting for the
long job to finish. This mode is best for multi-user settings
bit1...@163.com
Can someone help take a look at my questions? Thanks.
bit1...@163.com
From: bit1...@163.com
Date: 2015-05-29 18:57
To: user
Subject: How Broadcast variable works
Hi,
I have a spark streaming application. SparkContext uses broadcast vriables to
broadcast Configuration information that each
Can someone please help me on this?
bit1...@163.com
发件人: bit1...@163.com
发送时间: 2015-05-24 13:53
收件人: user
主题: How to use zookeeper in Spark Streaming
Hi,
In my spark streaming application, when the application starts and get running,
the Tasks running on the Worker nodes need
Correct myself:
For the SparkContext#wholeTextFile, the RDD's elements are kv pairs, the key is
the file path, and the value is the file content
So,for the SparkContext#wholeTextFile, the RDD has already carried the file
information.
bit1...@163.com
From: Saisai Shao
Date: 2015-04-29 15:50
Thanks Sandy, it is very useful!
bit1...@163.com
From: Sandy Ryza
Date: 2015-04-29 15:24
To: bit1...@163.com
CC: user
Subject: Re: Question about Memory Used and VCores Used
Hi,
Good question. The extra memory comes from spark.yarn.executor.memoryOverhead,
the space used
think
the memory used should be executor-memory*numOfWorkers=3G*3=9G, and the Vcores
used shoulde be executor-cores*numOfWorkers=6
Can you please explain the result?Thanks.
bit1...@163.com
Looks to me that the same thing also applies to the SparkContext.textFile or
SparkContext.wholeTextFile, there is no way in RDD to figure out the file
information where the data in RDD is from
bit1...@163.com
From: Saisai Shao
Date: 2015-04-29 10:10
To: lokeshkumar
CC: spark users
Subject
For the SparkContext#textFile, if a directory is given as the path parameter
,then it will pick up the files in the directory, so the same thing will occur.
bit1...@163.com
From: Saisai Shao
Date: 2015-04-29 10:54
To: Vadim Bichutskiy
CC: bit1...@163.com; lokeshkumar; user
Subject: Re: Re
Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce on
disk (without the use of memory cache). I have no convencing answer for this
question, could you guys elaborate on this? Thanks!
Is it? I learned somewhere else that spark's speed is 5~10 times faster than
Hadoop MapReduce.
bit1...@163.com
From: Ilya Ganelin
Date: 2015-04-28 10:55
To: bit1...@163.com; user
Subject: Re: Why Spark is much faster than Hadoop MapReduce even on disk
I believe the typical answer
Looks the message is consumed by the another console?( can see messages typed
on this port from another console.)
bit1...@163.com
From: Shushant Arora
Date: 2015-04-15 17:11
To: Akhil Das
CC: user@spark.apache.org
Subject: Re: spark streaming printing no output
When I launched spark-shell
Thanks Tathagata for the explanation!
bit1...@163.com
From: Tathagata Das
Date: 2015-04-04 01:28
To: Ted Yu
CC: bit1129; user
Subject: Re: About Waiting batches on the spark streaming UI
Maybe that should be marked as waiting as well. Will keep that in mind. We plan
to update the ui soon, so
: 23
Waiting batches: 1
Received records: 0
Processed records: 0
bit1...@163.com
Please make sure that you have given more cores than Receiver numbers.
From: James King
Date: 2015-04-01 15:21
To: user
Subject: Spark + Kafka
I have a simple setup/runtime of Kafka and Sprak.
I have a command line consumer displaying arrivals to Kafka topic. So i know
messages are being
Thanks Cheng for the great explanation!
bit1...@163.com
From: Cheng Lian
Date: 2015-03-16 00:53
To: bit1...@163.com; Wang, Daoyuan; user
Subject: Re: Explanation on the Hive in the Spark assembly
Spark SQL supports most commonly used features of HiveQL. However, different
HiveQL statements
Thanks Daoyuan.
What do you mean by running some native command, I never thought that hive will
run without an computing engine like Hadoop MR or spark. Thanks.
bit1...@163.com
From: Wang, Daoyuan
Date: 2015-03-13 16:39
To: bit1...@163.com; user
Subject: RE: Explanation on the Hive
for
the application.
My question is: Assume that the data the application will process is spread on
all the worker nodes, then the data locality is lost if using the above policy?
Not sure whether I have unstandood correctly or I have missed something.
bit1...@163.com
and Hive on
Hadoop?
2. Does Hive in the spark assembly use Spark execution engine or Hadoop MR
engine?
Thanks.
bit1...@163.com
Can anyone have a look on this question? Thanks.
bit1...@163.com
From: bit1...@163.com
Date: 2015-03-13 16:24
To: user
Subject: Explanation on the Hive in the Spark assembly
Hi, sparkers,
I am kind of confused about hive in the spark assembly. I think hive in the
spark assembly
Hi ,
I know that spark on yarn has a configuration parameter(executor-cores NUM) to
specify the number of cores per executor.
How about spark standalone? I can specify the total cores, but how could I know
how many cores each executor will take(presume one node one executor)?
bit1...@163
Sure, Thanks Tathagata!
bit1...@163.com
From: Tathagata Das
Date: 2015-02-26 14:47
To: bit1...@163.com
CC: Akhil Das; user
Subject: Re: Re: Many Receiver vs. Many threads per Receiver
Spark Streaming has a new Kafka direct stream, to be release as experimental
feature with 1.3. That uses
Thanks Akhil.
Not sure whether thelowlevel consumer.will be officially supported by Spark
Streaming. So far, I don't see it mentioned/documented in the spark streaming
programming guide.
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 16:21
To: bit1...@163.com
CC: user
Subject: Re: Many
( _ =
KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
)
//repartition to 18, 3 times of the receiver
val partitions = ssc.union(streams).repartition(18).map(DataReceived: + _)
partitions.print()
ssc.start()
ssc.awaitTermination()
}
}
bit1...@163.com
Thanks both of you guys on this!
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 12:58
To: Tathagata Das
CC: user; bit1129
Subject: Re: About FlumeUtils.createStream
I see, thanks for the clarification TD.
On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote:
Akhil
The behvior is exactly what I expected. Thanks Akhil and Tathagata!
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 13:32
To: bit1129
CC: Tathagata Das; user
Subject: Re: Re: About FlumeUtils.createStream
That depends on how many machines you have in your cluster. Say you have 6
workers
will stay on one cluster node, or will they distributed
among the cluster nodes?
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 12:58
To: Tathagata Das
CC: user; bit1129
Subject: Re: About FlumeUtils.createStream
I see, thanks for the clarification TD.
On 24 Feb 2015 09:56, Tathagata Das t
)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 32 more
bit1...@163.com
From: Ted Yu
Date: 2015-02-24 10:24
To: bit1...@163.com
CC: user
Subject: Re: Does Spark Streaming depend on Hadoop?
Can you pastebin the whole stack trace ?
Thanks
On Feb 23, 2015, at 6:14 PM, bit1...@163.com
main java.net.ConnectException: Call From
hadoop.master/192.168.26.137 to hadoop.master:9000 failed on connection
exception.
From the exception, it tries to connect to 9000 which is for Hadoop/HDFS. and I
don't use Hadoop at all in my code(such as save to HDFS).
bit1...@163.com
Thanks Tathagata! You are right, I have packaged the contents of the spark
shipped example jar into my jarwhich contains serveral HDFS configuration
files like hdfs-default.xml etc. Thanks!
bit1...@163.com
From: Tathagata Das
Date: 2015-02-24 12:04
To: bit1...@163.com
CC: yuzhihong
on this. Thank
bit1...@163.com
Thanks Akhil.
From: Akhil Das
Date: 2015-02-20 16:29
To: bit1...@163.com
CC: user
Subject: Re: Re: Spark streaming doesn't print output when working with
standalone master
local[3] spawns 3 threads on 1 core :)
Thanks
Best Regards
On Fri, Feb 20, 2015 at 12:50 PM, bit1...@163.com bit1
Hi,
In the spark streaming application, I write the code,
FlumeUtils.createStream(ssc,localhost,),which means spark will listen on
the port, and wait for Flume Sink to write to it.
My question is: when I submit the application to the Spark Standalone cluster,
will be opened only
only be allocated one processor.
This leads to me another question:
Although I have only one core, If I have specified the master and executor as
--master local[3] --executor-memory 512M --total-executor-cores 3. Since I have
only one core, why does this work?
bit1...@163.com
From: Akhil
Hi,
I am trying the spark streaming log analysis reference application provided by
Databricks at
https://github.com/databricks/reference-apps/tree/master/logs_analyzer
When I deploy the code to the standalone cluster, there is no output at will
with the following shell script.Which means, the
I am using spark 1.2.0(prebuild with hadoop 2.4) on windows7
I found a same bug here https://issues.apache.org/jira/browse/SPARK-4208,but it
is still open, is there a workaround for this? Thanks!
The stack trace:
StackOverflow Exception occurs
Exception in thread main
But I am able to run the SparkPi example:
./run-example SparkPi 1000 --master spark://192.168.26.131:7077
Result:Pi is roughly 3.14173708
bit1...@163.com
From: bit1...@163.com
Date: 2015-02-18 16:29
To: user
Subject: Problem with 1 master + 2 slaves cluster
Hi sparkers,
I setup a spark(1.2.1
Sure, thanks Akhil.
A further question : Is local file system(file:///) not supported in standalone
cluster?
bit1...@163.com
From: Akhil Das
Date: 2015-02-18 17:35
To: bit1...@163.com
CC: user
Subject: Re: Problem with 1 master + 2 slaves cluster
Since the cluster is standalone, you
Hi Arush,
With your code, I still didn't see the output Received X flumes events..
bit1...@163.com
From: bit1...@163.com
Date: 2015-02-17 14:08
To: Arush Kharbanda
CC: user
Subject: Re: Re: Question about spark streaming+Flume
Ok, you are missing a letter in foreachRDD.. let me proceed
Hi,
I am trying Spark Streaming + Flume example:
1. Code
object SparkFlumeNGExample {
def main(args : Array[String]) {
val conf = new SparkConf().setAppName(SparkFlumeNGExample)
val ssc = new StreamingContext(conf, Seconds(10))
val lines =
Ok, you are missing a letter in foreachRDD.. let me proceed..
bit1...@163.com
From: Arush Kharbanda
Date: 2015-02-17 14:31
To: bit1...@163.com
CC: user
Subject: Re: Question about spark streaming+Flume
Hi
Can you try this
val lines = FlumeUtils.createStream(ssc,localhost,)
// Print
14:31
To: bit1...@163.com
CC: user
Subject: Re: Question about spark streaming+Flume
Hi
Can you try this
val lines = FlumeUtils.createStream(ssc,localhost,)
// Print out the count of events received from this server in each batch
lines.count().map(cnt = Received + cnt + flume events
You can use prebuilt version that is built upon hadoop2.4.
From: Siddharth Ubale
Date: 2015-01-30 15:50
To: user@spark.apache.org
Subject: Hi: hadoop 2.5 for spark
Hi ,
I am beginner with Apache spark.
Can anyone let me know if it is mandatory to build spark with the Hadoop
version I am
I have also thought that Hadoop mapper output result is saved on HDFS, at least
if the job only has Mapper but doesn't have Reducer.
If there is reducer, then the map output will be saved on local disk?
From: Shao, Saisai
Date: 2015-01-26 15:23
To: Larry Liu
CC:
When I run the following spark sql example within Idea, I got the
StackOverflowError, lookes like the scala.util.parsing.combinator.Parsers are
calling recursively and infinitely.
Anyone encounters this?
package spark.examples
import org.apache.spark.{SparkContext, SparkConf}
import
Hi,
When I fetch the Spark code base and import into Intellj Idea as SBT project,
then I build it with SBT, but there is compiling errors in the examples
module,complaining that the EventBatch and SparkFlumeProtocol,looks they should
be in
org.apache.spark.streaming.flume.sink package.
Not
Thanks Eric. Yes..I am Chinese, :-). I will read through the articles, thank
you!
bit1...@163.com
From: eric wong
Date: 2015-01-07 10:46
To: bit1...@163.com
CC: user
Subject: Re: Re: I think I am almost lost in the internals of Spark
A good beginning if you are chinese.
https://github.com
The error hints that the maven module scala-compiler can't be fetched from
repo1.maven.org. Should some repositoy urls be added to the Maven's settings
file?
bit1...@163.com
From: Manoj Kumar
Date: 2015-01-03 18:46
To: user
Subject: Unable to build spark from source
Hello,
I tried
This is a noise,please ignore
I figured out what happens...
bit1...@163.com
From: bit1...@163.com
Date: 2015-01-03 19:03
To: user
Subject: sqlContext is undefined in the Spark Shell
Hi,
In the spark shell, I do the following two things:
1. scala val cxt = new
._
Is there something missing? I am using Spark 1.2.0.
Thanks.
bit1...@163.com
71 matches
Mail list logo