I confirm that is indeed the case. It is designed to be so because it
keeps things simpler - less chances of issues related to cleanup when
stop() is called. Also it keeps things consistent with the spark context -
once a spark context is stopped it cannot be used any more.
You can create a new
To clarify, you are looking for eigenvectors of what, the covariance
matrix? So for example you are looking for the sqrt of the eigenvalues when
you talk about stdev of components?
Looking at
Have a look at the code maybe?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Yes there is a smoothing parameter, and yes from the looks of it it is
simply additive / Laplace smoothing. It's been in there for a while.
On
Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
The obvious result should be (2,2) and (5,2) ... (you can draw them if you
don't believe me ...)
Thanks,
Wanda
I ran it, and your answer is exactly what I got.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._
val vectors =
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
= Vectors.dense(Array[Double](p._1, p._2
val
Not sure that was what I want. I tried to run Spark Shell on a machine other
than the master and got the same error. The 192 was suppose to be a
simple shell script change that alters SPARK_HOME before submitting jobs.
Too bad it wasn't there anymore.
The build described in the pull request
so this is what I am running:
./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001
And this is the input file:
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
This is the final output from spark:
14/07/10
Do you see any errors in the logs of the driver?
On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote:
I'm running an App for hours in a standalone cluster. From the data
injector and Streaming tab of web ui, it's running well.
However, I see quite a lot of Active stages in
Hi
I get exactly the same problem here, do you've found the problem ?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html
Sent from the
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how
Hi
I am trying out a simple piece of code by writing my own JavaNetworkCount
app to test out Spark Streaming
So here is the 2 set of the codes.
// #1
JavaReceiverInputDStreamString lines = sctx.socketTextStream(127.0.0.1,
);
// #2
JavaReceiverInputDStreamString lines =
Hi all,
Has anybody tried to run scrapy on a cluster? If yes, I would appreciate
hearing about the general approach that was taken (multiple spiders? single
spider? how to distribute urls across nodes?...etc). I would also be
interested in hearing about any experience running a different scraper
History Server is also very helpful.
On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang hw...@qilinsoft.com wrote:
I didn't keep the driver's log. It's a lesson.
I will try to run it again to see if it happens again.
--
*From:* Tathagata Das
Ya thanks. I can see that lambda is used as the parameter.
On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen so...@cloudera.com wrote:
Have a look at the code maybe?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Yes there
And also that there is a small bug in implementation. As I mentioned this
earlier also.
This is my first time I am reporting some bug. So I just wanted to ask,
that do your name come somewhere or do you get something after
correcting/reporting some bug. So that i can mention that in my
Hi all,
I want to know how collect() works, and how it is different from take().I am
just reading a file of 330MB which has 43lakh rows with 13 columns and calling
take(430) to save to a variable.But the same is not working with
collect().So is there any difference in the operation of
Hi,
any suggestions on how to implement OR clause and IN clause in the SparkSQL
language integrated queries.
For example:
'SELECT name FROM people WHERE age = 10 AND month = 2' can be written as
val teenagers = people.where('age = 10).where('month === 2).select('name)
How do we write
Hi Prem,
You can write like:
people.where('age = 10 'month === 2).select('name)
people.where('age = 10 || 'month === 2).select('name)
people.where(In('month, Seq(2, 6))).select('name)
The last one needs to import catalyst.expressions package.
Thanks.
2014-07-10 22:15 GMT+09:00
Okie doke. Thanks for the confirmation, Burak and Tathagata.
On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I confirm that is indeed the case. It is designed to be so because it
keeps things simpler - less chances of issues related to cleanup when
stop()
One thing to keep in mind is that the progress bar doesn't take into
account tasks which are rerun. If you see 4/4 but the stage is still
active, click the stage name and look at the task list. That will show you
if any are actually running. When rerun tasks complete, it can result in
the number
Thanks Takuya . works like a Charm
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Language-Integrated-query-OR-clause-and-IN-clause-tp9298p9303.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Tried the newest branch, but still get stuck on the same task: (kill) runJob
at SlidingRDD.scala:74
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html
Sent from the Apache Spark User List mailing list
A patch proposal on the apache JIRA for Spark?
https://issues.apache.org/jira/browse/SPARK/
Bertrand
On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
And also that there is a small bug in implementation. As I mentioned this
earlier also.
This is my first
I am also running into modules/mod_authn_alias.so issue on r3.8xlarge when
launched cluster with ./spark-ec2; so ganglia is not accessible. From the
posts it seems that Patrick suggested using Ubuntu 12.04. Can you please
provide name of AMI that can be used with -a flag that will not have this
Hi Spark developers,
I have the following hqls that spark will throw exceptions of this kind:
14/07/10 15:07:55 INFO TaskSetManager: Loss was due to
org.apache.spark.TaskKilledException [duplicate 17]
org.apache.spark.SparkException: Job aborted due to stage failure: Task
0.0:736 failed 4 times,
I am also running into modules/mod_authn_alias.so issue on r3.8xlarge when
launched cluster with ./spark-ec2; so ganglia is not accessible. From the
posts it seems that Patrick suggested using Ubuntu 12.04. Can you please
provide name of AMI that can be used with -a flag that will not have this
Hi,
I am doing graph computation using GraphX, but it seems to be an error on graph
partition strategy specification.
as in GraphX programming guide:
The Graph.partitionBy operator allows users to choose the graph partitioning
strategy, but due to SPARK-1931, this method is broken in Spark
Hello Folks:
I attended the session Aron D did at hadoop summit. He mentioned about
training the model and saving it on HDFS. When you start scoring you can
read the saved model.
So, I can save the model using
sc.makeRDD(model.clusterCenters).saveAsObjectFile(model)
But when I try to read the
Hi All,
I have a class with too many variables to be implemented as a case class,
therefor I am using regular class that implements Scala's product interface.
Like so:
class Info () extends Product with Serializable {
var param1 : String =
var param2 : String =
...
var param38:
I have created the issue:
In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an
implementation bug
Have a look at it.
Thanks,
On Thu, Jul 10, 2014 at 8:37 PM, Bertrand Dechoux decho...@gmail.com
wrote:
A patch proposal on the apache JIRA for Spark?
Tobias,
Your help on the problems I have met have been very helpful. Thanks a lot!
Bill
On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I ran the SparkKMeans example (not the
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way
i can log these details on the console rather than logging it. As of now
once i start my application i could see this,
14/07/10 00:48:20 INFO yarn.Client: Application report from ASM:
application identifier:
news20.binary's feature dimension is 1.35M. So the serialized task
size is above the default limit 10M. You need to set
spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this
parameter is not passed to executors automatically, which causes Spark
freezes. This was fixed in the latest
That output means you're running in yarn-cluster mode. So your code is
running inside the ApplicationMaster and has no access to the local
terminal.
If you want to see the output:
- try yarn-client mode, then your code will run inside the launcher process
- check the RM web ui and look at the
Thank you very much Yana for replying!
So right now the set up is a single-node machine which is my cluster, and
YES you are right my submitting laptop has a different path to the
spark-1.0.0 installation than the cluster machine.
I tried to set SPARK_HOME on my submittor laptop using the actual
I'm just wondering what's the general recommendation for data pipeline
automation.
Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...
It looks like Oozie might be the best choice. But I'd like some
What do people usually do for this?
It looks like Yarn might be the simplest since the Cloudera distribution
already installs this for you when you install hadoop.
Any advantages of using Mesos instead?
Thanks.
--
View this message in context:
So, one portion of our Spark streaming application requires some
state. Our application takes a bunch of application events (i.e.
user_session_started, user_session_ended, etc..), and calculates out
metrics from these, and writes them to a serving layer (see: Lambda
Architecture). Two related
Yes, this is what I tried, but thanks!
On Wed, Jul 9, 2014 at 6:02 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Siyuan,
I do it like this:
// get data from Kafka
val ssc = new StreamingContext(...)
val kvPairs = KafkaUtils.createStream(...)
// we need to wrap the data in a case class
I have a newbie question. What is the difference between SparkSQL and Shark?
Best,
Siyuan
We use Luigi for this purpose. (Our pipelines are typically on AWS (no
EMR) backed by S3 and using combinations of Python jobs, non-Spark
Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to
the master, and those are what is invoked from Luigi.)
—
p...@mult.ifario.us |
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001
I get this response:
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)
The start point is not random. It uses the first K points from the given set
On
In short, Spark SQL is the future, built from the ground up. Shark was
built as a drop-in replacement for Hive, will be retired, and will perhaps
be replaced by a future initiative to run Hive on Spark
https://issues.apache.org/jira/browse/HIVE-7292.
More info:
-
Hi Yadid,
I have the same problem with you so I implemented the product interface as
well, even the codes are similar with your codes. But now I face another
problem that is I don't know how to run the codes...My whole program is like
this:
object SimpleApp {
class Record(val x1: String,
I am running spark-1.0.0 with java 1.8
java version 1.8.0_05
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
which spark-shell
~/bench/spark-1.0.0/bin/spark-shell
which scala
~/bench/scala-2.10.4/bin/scala
On Thursday, July
Setting SPARK_HOME is not super effective, because it is overridden very
quickly by bin/spark-submit here
https://github.com/apache/spark/blob/88006a62377d2b7c9886ba49ceef158737bc1b97/bin/spark-submit#L20.
Instead you should set the config spark.home. Here's why:
Each of your executors inherits
Yes, there are a few bugs in the UI in the event of a node failure.
The duplicated stages in both the active and completed tables should be
fixed by this PR: https://github.com/apache/spark/pull/1262
The fact that the progress bar on the stages page displays an overflow
(e.g. 5/4) is still an
Hi all,
I am working to improve the parallelism of the Spark Streaming application.
But I have problem in understanding how the executors are used and the
application is distributed.
1. In YARN, is one executor equal one container?
2. I saw the statement that a streaming receiver runs on one
Hi Haoming,
For your spark-submit question: can you try using an assembly jar
(sbt/sbt assembly will build it for you)? Another thing to check is
if there is any package structure that contains your SimpleApp; if so
you should include the hierarchal name.
Zongheng
On Thu, Jul 10, 2014 at 11:33
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI iamyifa...@gmail.com wrote:
- how to build the latest version of Spark from the master branch, which
contains a fix?
Instead of downloading a prebuilt Spark release from
http://spark.apache.org/downloads.html, follow the instructions under
Development
Hello,
I want to run an MLlib task in Scala API, that creates a hadoopRDD from a
CustomInputFormat. According to Spark API
def hadoopRDD[K, V](conf: JobConf, inputFormatClass: Class[_ :
org.apache.hadoop.mapred.InputFormat[K,V]], keyClass: Class[K], valueClass:
Class[V], minSplits: Int): RDD
Interesting question on Stack Overflow:
http://stackoverflow.com/q/24677180/877069
Basically, is there a way to take() elements of an RDD at an arbitrary
index?
Nick
--
View this message in context:
The implementation of the input-stream-to-iterator function in #2 is
incorrect. The function should be such that, when the hasNext is called on
the iterator, it should try to read from the buffered reader. If an object
(that is, line) can be read, then return it, otherwise block and wait for
data
This is expensive but doable:
rdd.zipWithIndex().filter { case (_, idx) = idx = 10 idx 20 }.collect()
-Xiangrui
On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas
nicholas.cham...@gmail.com wrote:
Interesting question on Stack Overflow:
http://stackoverflow.com/q/24677180/877069
Basically, is
I am new to spark. I am trying to do the following.
Netcat--Flume--Spark streaming(process Flume Data)--HDFS.
My flume config file has following set up.
Source = netcat
Sink=avrosink.
Spark Streaming code:
I am able to print data from flume to the monitor. But I am struggling to
create a file.
Hi Jerry,
To add to your question:
Following does work (from master)- notice the registerAsTable is commented
: (I took a liberty to add the order by clause)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql(USE test)
// hql(select id from
Hi C.J.,
The PR Yana pointed out seems to fix this. However, it is not merged in
master yet, so for now I would recommend that you try the following
workaround: set spark.home to the executor's /path/to/spark. I provided
more detail here:
Hi Spark users and developers,
I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via SparkSQL. It is very bothersome. So your
help in understanding why it is terribly slow is very very important.
First, we have some text files in HDFS which
By the way, I also try hql(select * from m).count. It is terribly slow
too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote:
Hi Spark users and developers,
I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via
Hi,
I'm quite a Spark newbie so I might be wrong but I think that
registerAsTable works either on case classes or on classes extending
Product.
You find this info in an example on the doc page of Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html
// Define the schema
Hi Zongheng,
Thanks a lot for your reply.
I was edited my codes in my group project and I forgot to remove the package
declaration...How silly!
Regards,
Haoming
Date: Thu, 10 Jul 2014 12:00:40 -0700
Subject: Re: SPARKSQL problem with implementing Scala's Product interface
From:
Looks like twitter4j http://twitter4j.org/archive/ 2.2.6 is what works,
but I don’t believe it’s documented anywhere.
Using 3.0.6 works for a while, but then causes the following error:
14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread
Hi Spark users,
Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote:
By the way, I also try hql(select * from m).count. It is terribly
In various previous versions of Spark (and I believe the current
version, 1.0.0, as well) we have noticed that it does not seem possible
to have a local SparkContext and a SparkContext connected to a cluster
via either a Spark Cluster (i.e. using the Spark resource manager) or a
YARN cluster.
Hi,
I have a csv data file, which I have organized in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):
1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2
The first column is the binary label (1 or 0) and the
I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.
Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can
I have a cluster running. I was able to run Spark Shell and submit programs.
But when I tried to use SparkR, I got these errors:
wifi-orcus:sparkR cwang$ MASTER=spark://wifi-orcus.dhcp.carrieriq.com:7077
sparkR
R version 3.1.0 (2014-04-10) -- Spring Dance
Copyright (C) 2014 The R Foundation
Hi Spark folks,
So on our production Spark cluster, it lives in the data center and I need
to attach to a VPN from my laptop, so that I can then submit a Spark
application job to the Spark Master (behind the VPN).
However, the problem arises that I have a local IP address on the laptop
which is
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:
For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?
There is going to be some
Hi Jerry,
Thanks for reporting this. It would be helpful if you could provide the
output of the following command:
println(hql(select s.id from m join s on (s.id=m_id)).queryExecution)
Michael
On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Spark developers,
I
I'll add that the SQL parser is very limited right now, and that you'll get
much wider coverage using hql inside of HiveContext. We are working on
bringing sql() much closer to SQL-92 though in the future.
On Thu, Jul 10, 2014 at 7:28 AM, premdass premdas...@yahoo.co.in wrote:
Thanks Takuya .
There is no version of Shark that is compatible with Spark 1.0, however,
Spark SQL does come included automatically. More information here:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The
exception that you are seeing is something that should be looked into. Can
you give us more logs (specially executor logs) with stack traces that has
the error.
TD
On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.
TD
On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I have a
This bug has been fixed. Either use the master branch of Spark, or maybe
wait a few days for Spark 1.0.1 to be released (voting has successfully
closed).
TD
On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote:
Hi
I get exactly the same problem here, do you've found the
The fileStream is not designed to work with continuously updating file, as
the one of the main design goals of Spark is immutability (to guarantee
fault-tolerance by recomputation), and files that are appending (mutating)
defeats that. It rather designed to pickup new files added atomically
(using
Hi Tathagata,
I set default parallelism as 300 in my configuration file. Sometimes there
are more executors in a job. However, it is still slow. And I further
observed that most executors take less than 20 seconds but two of them take
much longer such as 2 minutes. The data size is very small
Hi Michael,
Yes the table is partitioned on 1 column. There are 11 columns in the table
and they are all String type.
I understand that SerDes contributes to some overheads but using pure Hive,
we could run the query about 5 times faster than SparkSQL. Given that Hive
also has the same SerDes
Hi Michael,
I got the log you asked for. Note that I manually edited the table name and
the field names to hide some sensitive information.
== Logical Plan ==
Project ['s.id]
Join Inner, Some((id#106 = 'm.id))
Project [id#96 AS id#62]
MetastoreRelation test, m, None
MetastoreRelation
I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that I have also seen elsewhere. I've created a JIRA and assigned someone
at databricks to investigate.
https://issues.apache.org/jira/browse/SPARK-2443
On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com
Can you try setting the number-of-partitions in all the shuffle-based
DStream operations, explicitly. It may be the case that the default
parallelism (that is, spark.default.parallelism) is probably not being
respected.
Regarding the unusual delay, I would look at the task details of that stage
Hmm, yeah looks like the table name is not getting applied to the
attributes of m. You can work around this by rewriting your query as:
hql(select s.id from (SELECT * FROM m) m join s on (s.id=m.id) order by
s.id
This explicitly gives the alias m to the attributes of that table. You can
also
Hi,
I think it would be great if we could do the string parsing only once and
then just apply the transformation for each interval (reducing the
processing overhead for short intervals).
Also, one issue with the approach above is that transform() has the
following signature:
def
Andrew,
Thanks for replying. I did the following and the result was still the same.
1. Added spark.home /root/spark-1.0.0 to local conf/spark-defaults.conf,
where /root was the place in the cluster where I put Spark.
2. Ran bin/spark-shell --master
Yeah, the right solution is to have something like SchemaDStream, where the
schema of all the schemaRDD generated by it can be stored. Something I
really would like to see happen in the future :)
TD
On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I think it
Hi TD
Thanks.
I have problem understanding the codes in github, Object
SocketReceiver.byteToLines(...)
https://github.com/apache/spark/blob/095b5182536a43e2ae738be93294ee5215d86581/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala
private[streaming]
Actually we have a POC project which shows the power of combining Spark
Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get
SchemaDStream. You can take a look at it:
https://github.com/thunderain-project/StreamSQL
Thanks
Jerry
From: Tathagata Das
Hi
I am learning spark streaming, and is trying out the JavaNetworkCount
example.
#1 - This is the code I wrote
JavaStreamingContext sctx = new JavaStreamingContext(local, appName, new
Duration(5000));
JavaReceiverInputDStreamString lines = sctx.socketTextStream(127.0.0.1,
);
No specific plans to do so, since there has some functional loss like time
based windowing function which is important for streaming sql. Also keep
compatible with fast growing SparkSQL is quite hard. So no clear plans to
submit to upstream.
-Jerry
From: Tobias Pfeiffer
Right this uses NextIterator, which is elsewhere in the repo. It just makes
it cleaner to implement a custom iterator. But i guess you got the high
level point, so its okay.
TD
On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote:
Hi TD
Thanks.
I have problem understanding
Do you want to continuously maintain the set of unique integers seen since
the beginning of stream?
var uniqueValuesRDD: RDD[Int] = ...
dstreamOfIntegers.transform(newDataRDD = {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
//
In addition to wiki on Confluence
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals or
reading source code, where/how can I get more information about Pyspark
internals, for I am so familiar with python :(
--
View this message in context:
Hi,
I am trying to run a program with spark streaming using Kafka on a stand
alone system. These are my details:
Spark 1.0.0 hadoop2
Scala 2.10.3
I am trying a simple program using my custom sbt project but this is the
error I am getting:
Exception in thread main
Hi,
I did not find any videos on apache spark channel in youtube yet.
Any idea when these will be made available ?
Regards,
Ajay
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Hope it's useful for you.
Davies
On Thu, Jul 10, 2014 at 8:49 PM, Baofeng Zhang pelickzh...@qq.com wrote:
In addition to wiki on Confluence
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals or
reading
97 matches
Mail list logo