If I understand correctly, you could not change the number of executors at
runtime right(correct me if am wrong) - its defined when we start the
application and fixed. Do you mean number of tasks?
On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Can you try
Easiest fix would be adding the kafka jars to the SparkContext while
creating it.
Thanks
Best Regards
On Fri, Jul 11, 2014 at 4:39 AM, Dilip dilip_ram...@hotmail.com wrote:
Hi,
I am trying to run a program with spark streaming using Kafka on a stand
alone system. These are my details:
You simply use the *nc* command to do this. like:
nc -p 12345
will open the 12345 port and from the terminal you can provide whatever
input you require for your StreamingCode.
Thanks
Best Regards
On Fri, Jul 11, 2014 at 2:41 AM, kytay kaiyang@gmail.com wrote:
Hi
I am learning spark
Hi Praveen,
I did not change the number of total executors. I specified 300 as the
number of executors when I submitted the jobs. However, for some stages,
the number of executors is very small, leading to long calculation time
even for small data set. That means not all executors were used for
Sorry, the command is
nc -lk 12345
Thanks
Best Regards
On Fri, Jul 11, 2014 at 6:46 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You simply use the *nc* command to do this. like:
nc -p 12345
will open the 12345 port and from the terminal you can provide whatever
input you require for
Hi Akhil,
Can you please guide me through this? Because the code I am running
already has this in it:
[java]
SparkContext sc = new SparkContext();
sc.addJar(/usr/local/spark/external/kafka/target/scala-2.10/spark-streaming-kafka_2.10-1.1.0-SNAPSHOT.jar);
Is there something I am
Hi Tathagata,
Thanks for the solution. Actually, I will use the number of unique integers
in the batch instead of accumulative number of unique integers.
I do have two questions about your code:
1. Why do we need uniqueValuesRDD? Why do we need to call
uniqueValuesRDD.checkpoint()?
2. Where
I have met similar issues. The reason is probably because in Spark
assembly, spark-streaming-kafka is not included. Currently, I am using
Maven to generate a shaded package with all the dependencies. You may try
to use sbt assembly to include the dependencies in your jar file.
Bill
On Thu, Jul
You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.
You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and scheduling
with retries.
Hi Tathagata,
I also tried to use the number of partitions as parameters to the functions
such as groupByKey. It seems the numbers of executors is around 50 instead
of 300, which is the number of the executors I specified in submission
script. Moreover, the running time of different executors is
We use Azkaban for a short time and suffer a lot. Finally we almost rewrite
it totally. Don’t recommend it really.
发件人: Nick Pentreath nick.pentre...@gmail.com
答复: user@spark.apache.org
日期: 2014年7月11日 星期五 下午3:18
至: user@spark.apache.org
主题: Re: Recommended pipeline automation tool? Oozie?
Did you use old azkaban or azkaban 2.5? It has been completely rewritten.
Not saying it is the best but I found it way better than oozie for example.
Sent from my iPhone
On 11 Jul 2014, at 09:24, 明风 mingf...@taobao.com wrote:
We use Azkaban for a short time and suffer a lot. Finally we
I also took a look at
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
and ran the code in a shell.
There is an issue here:
val initMode = params.initializationMode match {
case Random = KMeans.RANDOM
case Parallel = KMeans.K_MEANS_PARALLEL
A simple
sbt assembly
is not working. Is there any other way to include particular jars with
assembly command?
Regards,
Dilip
On Friday 11 July 2014 12:45 PM, Bill Jay wrote:
I have met similar issues. The reason is probably because in Spark
assembly, spark-streaming-kafka is not
Hi,
I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic
clustering with Strings.
My code works great when I use a five-figure amount of training elements.
However, with for example 2 million elements, it gets extremely slow. A
single stage may take up to 30 minutes.
From
How many partitions do you use for your data? if the default is 1, you
probably need to manually ask for more partitions.
Also, I'd check that your executors aren't thrashing close to the GC
limit. This can make things start to get very slow.
On Fri, Jul 11, 2014 at 9:53 AM, durin
Hi Akhil Das
I have tried the
nc -lk
command too.
I was hoping the System.out.println(Print text: + arg0); is printed when
a stream is processed when lines.flatMap(...) is called.
But from my test with nc -lk , nothing is printed on the console at
all.
==
To test out whether the nc
I think I should be seeing any line of text that I have typed in the nc
command.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-socketTextStream-to-receive-anything-tp9382p9410.html
Sent from the Apache Spark User List mailing list
Can you try this piece of code?
SparkConf sparkConf = new SparkConf().setAppName(JavaNetworkWordCount
);
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
Duration(1000));
JavaReceiverInputDStreamString lines = ssc.socketTextStream(
args[0],
Hi TD:
The input file is on hdfs.
The file is approx 2.7 GB and when the process starts, there are 11 tasks
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.
After the file has been processed, I see new stages with 2 tasks that continue
to be generated. I
I saw some exceptions like this in driver log. Can you shed some lights? Is it
related with the behaviour?
14/07/11 20:40:09 ERROR LiveListenerBus: Listener JobProgressListener threw an
exception
java.util.NoSuchElementException: key not found: 64019
at
Hi, folks.
We're having a problem with iteration that I don't understand.
We have the following test code:
org.apache.log4j.Logger.getLogger(org).setLevel(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger.getLogger(akka).setLevel(org.apache.log4j.Level.WARN)
def test (caching: Boolean,
Hi Folks,
Does any one have experience or recommendations on incorporating categorical
features (attributes) into k-means clustering in Spark? In other words, I want
to cluster on a set of attributes that include categorical variables.
I know I could probably implement some custom code to
Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.
On Fri, Jul
I see. So, basically, kind of like dummy variables like with regressions.
Thanks, Sean.
On Jul 11, 2014, at 10:11 AM, Sean Owen so...@cloudera.com wrote:
Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
Hi,
In the spark streaming paper, slack time has been suggested for delaying the
batch creation in case of external timestamps. I don't see any such option in
streamingcontext. Is it available in the API?
Also going through the previous posts, queueStream has been suggested for this.
I
Hi Gerard,
This was on my todos since long... i just published a Calliope snapshot
built against Hadoop 2.2.x, Take it for a spin if you get a chance -
You can get the jars from here -
-
Hi,
Databricks demo at spark summit was amazing...what's the frontend stack
used specifically for rendering multiple reactive charts on same dom? Looks
like that's an emerging pattern for correlating different data api...
Thanks
Deb
Hi Wanda,
As Sean mentioned, K-means is not guaranteed to find an optimal answer,
even for seemingly simple toy examples. A common heuristic to deal with
this issue is to run kmeans multiple times and choose the best answer. You
can do this by changing the runs parameter from the default value
I want to do Index similar to RDBMS on keyPnl on the pnl_type_code so that
group by can be done efficitently. How do I achieve that?
Currently below code blow out of memory in Spark on 60GB of data.
keyPnl is very large file. We have been stuck for 1 week. trying kryo,
mapvalue etc but without
You may try to use this one:
https://github.com/sbt/sbt-assembly
I had an issue of duplicate files in the uber jar file. But I think this
library will assemble dependencies into a single jar file.
Bill
On Fri, Jul 11, 2014 at 1:34 AM, Dilip dilip_ram...@hotmail.com wrote:
A simple
sbt
As mentioned, deprecated in Spark 1.0+.
Try to use the --driver-class-path:
./bin/spark-shell --driver-class-path yourlib.jar:abc.jar:xyz.jar
Don't use glob *, specify the JAR one by one with colon.
Date: Wed, 9 Jul 2014 13:45:07 -0700
From: kat...@cs.pitt.edu
Subject: SPARK_CLASSPATH Warning
1. Since the RDD of the previous batch is used to create the RDD of the
next batch, the lineage of dependencies in the RDDs continues to grow
infinitely. Thats not good because of it increases fault-recover times,
task sizes, etc. Checkpointing saves the data of an RDD to HDFS and
truncates the
Ok, I found it on JIRA SPARK-2390:
https://issues.apache.org/jira/browse/SPARK-2390
So it looks like this is a known issue.
From: alee...@hotmail.com
To: user@spark.apache.org
Subject: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file
option?
Date: Tue, 8 Jul 2014 15:17:00
Whenever you need to do a shuffle=based operation like reduceByKey,
groupByKey, join, etc., the system is essentially redistributing the data
across the cluster and it needs to know how many parts should it divide the
data into. Thats where the default parallelism is used.
TD
On Fri, Jul 11,
Hi Praveen,
Thank you for the answer. That's interesting because if I only bring up one
executor for the Spark Streaming, it seems only the receiver is working, no
other tasks are happening, by checking the log and UI. Maybe it's just
because the receiving task eats all the resource?, not because
What is the error you are getting when you say ??I was trying to write the
data to hdfs..but it fails…
TD
On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X.
muthu.x.sundaram@sabre.com wrote:
I am new to spark. I am trying to do the following.
NetcatàFlumeàSpark streaming(process Flume
Hi,
I'm running into an identical issue running Spark 1.0.0 on Mesos 0.19. Were
you able to get it sorted? There's no real documentation for the
spark.httpBroadcast.uri except what's in the code - is this config setting
required for running on a Mesos cluster?
I'm running this in a dev
So, is it expected for the process to generate stages/tasks even after
processing a file ?
Also, is there a way to figure out the file that is getting processed and when
that process is complete ?
Thanks
On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Hello all,
In our use case we would like to return top 10 predicted values. I've
looked at NaiveBayes LogisticRegressionModel and cannot seem to find a
way to get the predicted values for a vector - is this possible with
mllib/spark?
Thanks,
Rich
I don't believe it is. Recently when I needed to do this, I just
copied out the underlying probability / margin function and calculated
it from the model params. It's just a dot product.
On Fri, Jul 11, 2014 at 7:48 PM, Rich Kroll
rich.kr...@modernizingmedicine.com wrote:
Hello all,
In our use
I am trying to run Logistic Regression on the url dataset (from
libsvm) using the exact same code
as the example on a 5 node Yarn-Cluster.
I get a pretty cryptic error that says
Killed
Nothing more
Settings:
--master yarn-client
--verbose
--driver-memory 24G
--executor-memory 24G
Hi all,
My company is actively using spark machine learning library, and we would love
to see Gradient Boosting Machine algorithm (and perhaps Adaboost algorithm as
well) being implemented. I’d greatly appreciate it if anyone could help to move
it forward or to elevate this request.
Thanks,
On Fri, Jul 11, 2014 at 7:32 PM, durin m...@simon-schaefer.net wrote:
How would you get more partitions?
You can specify this as the second arg to methods that read your data
originally, like:
sc.textFile(..., 20)
I ran broadcastVector.value.repartition(5), but
Just curious: how about using scala to drive the workflow? I guess if you
use other tools (oozie, etc) you lose the advantage of reading from RDD --
you have to read from HDFS.
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research
Hi Joseph,
Thanks for your email. Many users are requesting this functionality, while
it would be a stretch for them to appear in Spark 1.1, various people
(including Manish Amde and folks at the AMPLab, Databricks and Alpine Labs)
are actively work on developing ensembles of decision trees
Hi,
I have a java application that is outputting a string every second. I'm
running the wordcount example that comes with Spark 1.0, and running nc -lk
. When I type words into the terminal running netcat, I get counts.
However, when I write the String onto a socket on port , I don't get
I forgot to add that I get the same behavior if I tail -f | nc localhost
on a log file.
On Fri, Jul 11, 2014 at 1:25 PM, Walrus theCat walrusthe...@gmail.com
wrote:
Hi,
I have a java application that is outputting a string every second. I'm
running the wordcount example that comes
netcat is listening for a connection on port . It is echoing what
you type to its console to anything that connects to and reads.
That is what Spark streaming does.
If you yourself connect to and write, nothing happens except that
netcat echoes it. This does not cause Spark to
Hi all,
We've been evaluating Spark for a long-term project. Although we've been
reading several topics in forum, any hints on the following topics we'll be
extremely welcomed:
1. Which are the data partition strategies available in Spark? How
configurable are these strategies?
2. How would be
The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files to be written into
that directory.
TD
On Fri, Jul 11, 2014 at 11:46 AM, M Singh
The same executor can be used for both receiving and processing,
irrespective of the deployment mode (yarn, spark standalone, etc.) It boils
down to the number of cores / task slots that executor has. Each receiver
is like a long running task, so each of them occupy a slot. If there are
free slots
Can you show us the program that you are running. If you are setting number
of partitions in the XYZ-ByKey operation as 300, then there should be 300
tasks for that stage, distributed on the 50 executors are allocated to your
context. However the data distribution may be skewed in which case, you
Hi CJ,
Looks like I overlook a few lines in the spark shell case. It appears that
spark shell
explicitly overwrites
https://github.com/apache/spark/blob/f4f46dec5ae1da48738b9b650d3de155b59c4674/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L955
spark.home to whatever SPARK_HOME is
Hi Tathagata,
Thank you. Is task slot equivalent to the core number? Or actually one core
can run multiple tasks at the same time?
Best,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
The same executor can
spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in
the cluster and gave 48g to executors. also tried kyro serialization.
traceback (most recent call last):
File /mohit/./m.py, line 58, in module
spark_data = sc.parallelize(spark_data_array)
File
Hi Tathagata,
Below is my main function. I omit some filtering and data conversion
functions. These functions are just a one-to-one mapping, which may not
possible increase running time. The only reduce function I have here is
groupByKey. There are 4 topics in my Kafka brokers and two of the
Hi,
I am trying graphx on live journal data. I have a cluster of 17 computing
nodes, 1 master and 16 workers. I had few questions about this.
* I built spark from spark-master (to avoid partitionBy error of spark 1.0).
* I am using edgeFileList() to load data and I figured I need to provide
If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield
better results. It's worth a try at least.
On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Solution 2 is to map the
ssimanta wrote
Solution 2 is to map the objects into a pair RDD where the
key is the number of the day in the interval, then group by
key, collect, and parallelize the resulting grouped data.
However, I worry collecting large data sets is going to be
a serious performance bottleneck.
Why do
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB shreyanshpbh...@gmail.com
wrote:
-- Is it a correct way to load file to get best performance?
Yes, edgeListFile should be efficient at loading the edges.
-- What should be the partition size? =computing node or =cores?
In general it should be a
Thanks a lot Ankur, I'll follow that.
A last quick
Does that error affect performance?
~Shreyansh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html
Sent from the Apache Spark User List
I think my best option is to partition my data in directories by day
before running my Spark application, and then direct
my Spark application to load RDD's from each directory when
I want to load a date range. How does this sound?
If your upstream system can write data by day then it makes
Sean Owen-2 wrote
Can you not just filter the range you want, then groupBy
timestamp/86400 ? That sounds like your solution 1 and is about as
fast as it gets, I think. Are you thinking you would have to filter
out each day individually from there, and that's why it would be slow?
I don't
Hi,
I tried out the streaming program on the Spark training web page. I created
a Twitter app as per the instructions (pointing to http://www.twitter.com).
When I run the program, my credentials get printed out correctly but
thereafter, my program just keeps waiting. It does not print out the
On Fri, Jul 11, 2014 at 10:53 PM, bdamos a...@adobe.com wrote:
I didn't make it clear in my first message that I want to obtain an RDD
instead
of an Iterable, and will be doing map-reduce like operations on the
data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])],
but I
I don't think it should affect performance very much, because GraphX
doesn't serialize ShippableVertexPartition in the fast path of
mapReduceTriplets execution (instead it calls
ShippableVertexPartition.shipVertexAttributes and serializes the result). I
think it should only get serialized for
I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier.
On Fri, Jul 11, 2014 at 3:11 PM, user-h...@spark.apache.org wrote:
Hi! This is the ezmlm program. I'm managing the
user@spark.apache.org mailing list.
To confirm that you would like
veera...@gmail.com
added to the user mailing list, please send
a short reply to this address:
Hey Jerry,
When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?
On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
mich...@databricks.com wrote:
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that
You dont get any exception from twitter.com, saying credential error or
something?
I have seen this happen when once one was behind vpn to his office, and
probably twitter was blocked in their office.
You could be having a similar issue.
TD
On Fri, Jul 11, 2014 at 2:57 PM, SK
Hi, all
I would like to give a try on JDBC server (which is supposed to be released in
1.1)
where can I find the document about that?
Best,
--
Nan Zhu
Yes, even though we dont have immediate plans, I definitely would like to
see it happen some time in not-so-distant future.
TD
On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote:
No specific plans to do so, since there has some functional loss like
time based
I totally agree that doing if we are able to do this it will be very cool.
However, this requires having a common trait / interface between RDD and
DStream, which we dont have as of now. It would be very cool though. On my
wish list for sure.
TD
On Thu, Jul 10, 2014 at 11:53 AM, mshah
nvm
for others with the same question:
https://github.com/apache/spark/commit/8032fe2fae3ac40a02c6018c52e76584a14b3438
--
Nan Zhu
On Friday, July 11, 2014 at 7:02 PM, Nan Zhu wrote:
Hi, all
I would like to give a try on JDBC server (which is supposed to be released
in 1.1)
where
I dont get any exceptions or error messages.
I tried it both with and without VPN and had the same outcome. But I can
try again without VPN later today and report back.
thanks.
--
View this message in context:
Hi folks,
I just ran another job that only received data from Kafka, did some
filtering, and then save as text files in HDFS. There was no reducing work
involved. Surprisingly, the number of executors for the saveAsTextFiles
stage was also 2 although I specified 300 executors in the job
Do you have a proxy server ?
If yes you need to set the proxy for twitter4j
On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote:
I dont get any exceptions or error messages.
I tried it both with and without VPN and had the same outcome. But I can
try again without VPN later
A while ago, I wrote this:
```
package com.virdata.core.compute.common.api
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext
sealed trait SparkEnvironment extends Serializable
I dont have a proxy server.
thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9481.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Facing a funny issue with the Spark class loader. Testing out a basic
functionality on a vagrant VM with spark running - looks like it's
attempting to ship the jar to a remote instance (in this case local) and
somehow is encountering the jar twice?
14/07/11 23:27:59 INFO DAGScheduler: Got job 0
Great! Thanks a lot.
Hate to say this but I promise this is last quickie
I looked at the configurations but I didn't find any parameter to tune for
network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using
1G network or 10G network or infinite band? Does it figure out on its
I just tested a long lived application (that we normally run in standalone
mode) on yarn in client mode.
it looks to me like cached rdds are missing in the storage tap of the ui.
accessing the rdd storage information via the spark context shows rdds as
fully cached but they are missing on
Hi,
I need to perform binary classification on an image dataset. Each image is a
data point described by a Json object. The feature set for each image is a
set of feature vectors, each feature vector corresponding to a distinct
object in the image. For example, if an image has 5 objects, its
Spark just uses opens up inter-slave TCP connections for message passing
during shuffles (I think the relevant code is in ConnectionManager). Since
TCP automatically determines
http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm the
optimal sending rate, Spark doesn't need any
Perfect! Thanks Ankur.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379
Best,
Xiangrui
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for
Try running a simple standalone program if you are using Scala and see if
you are getting any data. I use this to debug any connection/twitter4j
issues.
import twitter4j._
//put your keys and creds here
object Util {
val config = new twitter4j.conf.ConfigurationBuilder()
Does nothing get printed on the screen? If you are not getting any tweets
but spark streaming is running successfully you should get at least counts
being printed every batch (which would be zero). But they are not being
printed either, check the spark web ui to see stages are running or not. If
Task slot is equivalent to core number. So one core can only run one task
at a time.
TD
On Fri, Jul 11, 2014 at 1:57 PM, Yan Fang yanfang...@gmail.com wrote:
Hi Tathagata,
Thank you. Is task slot equivalent to the core number? Or actually one
core can run multiple tasks at the same time?
Aah, I get it now. That is because the input data streams is replicated on
two machines, so by locality the data is processed on those two machines.
So the map stage on the data uses 2 executors, but the reduce stage,
(after groupByKey) the saveAsTextFiles would use 300 tasks. And the default
I put the same dataset into scala (using spark-shell) and it acts weird. I
cannot do a count on it, the executors seem to hang. The WebUI shows 0/96
in the status bar, shows details about the worker nodes but there is no
progress.
sc.parallelize does finish (takes too long for the data size) in
This is not in the current streaming API.
Queue stream is useful for testing with generated RDDs, but not for actual
data. For actual data stream, the slack time can be implemented by doing
DStream.window on a larger window that take slack time in consideration,
and then the required
Hey Andy,
Thats pretty cool!! Is there a github repo where you can share this piece
of code for us to play around? If we can come up with a simple enough
general pattern, that can be very usefull!
TD
On Fri, Jul 11, 2014 at 4:12 PM, andy petrella andy.petre...@gmail.com
wrote:
A while ago, I
Hi Tathagata,
Do you mean that the data is not shuffled until the reduce stage? That
means groupBy still only uses 2 machines?
I think I used repartition(300) after I read the data from Kafka into
DStream. It seems that it did not guarantee that the map or reduce stages
will be run on 300
Congrats to the Spark community !
On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core
Thanks Michael
Missed that point as well as the integration of SQL within the scala shell
(with setting the SQLContext)Looking forward to feature parity with feature
releases. (Shark - Spark SQL)
Cheers.
From: mich...@databricks.com
Date: Thu, 10 Jul 2014 16:20:20 -0700
Subject: Re: EC2
I tried my test case with Spark 1.0.1 and saw the same result(27 pairs
becomes 25 pairs after zip).
Could someone please check it?
Regards,
xj
On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng men...@gmail.com wrote:
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master.
99 matches
Mail list logo