Spark Streaming
https://spark.apache.org/docs/latest/streaming-programming-guide.html is
the best fit for this use case. Basically you create a streaming context
pointing to that directory, also you can set the streaming interval (in
your case its 5 minutes). SparkStreaming will only process the
Hello,
I am trying to run a job on two workers. I have cluster of 3
computers where one is the master and the other two are workers. I am able
to successfully register the separate physical machines as workers in the
cluster. When I run a job with a single worker connected, it runs
I'm trying Naive Bayes classifier for Higg Boson challenge on Kaggle:
http://www.kaggle.com/c/higgs-boson
Here's the source code I'm working on:
https://github.com/dnprock/SparkHiggBoson/blob/master/src/main/scala/KaggleHiggBosonLabel.scala
Training data looks like this:
Hi all,
I’m totally newbie on Spark, so my question may be a dumb one.
I tried Spark to compute values, on this side all works perfectly (and it's
fast :) ).
At the end of the process, I have an RDD with Key(String)/Values(Array
of String), on this I want to get only one entry like this :
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:
You can use the function lookup() to accomplish this too; it may be a
bit faster.
It will never be efficient like a database lookup since this is
implemented by scanning through all of the data. There is no index or
anything.
On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier
Hello,
On the web UI of the master even though there are two workers shown, there
is only one executor. There is an executor for machine1 but no executor for
machine2. Hence if only machine1 is added as a worker the program runs but
if only machine2 is added, it fails with the same error 'Master
Thanks for your answer.
In my case, that’s sad cause we have only 60 entries in the final RDD, I was
thinking it will be fast to get the needed one.
Le 19 août 2014 à 09:58, Sean Owen so...@cloudera.com a écrit :
You can use the function lookup() to accomplish this too; it may be a
bit
sc.textFile already returns just one RDD for all of your files. The
sc.union is unnecessary, although I don't know if it's adding any
overhead. The data is certainly processed in parallel and how it is
parallelized depends on where the data is -- how many InputSplits
Hadoop produces for them.
If
In that case, why not collectAsMap() and have the whole result as a
simple Map in memory? then lookups are trivial. RDDs aren't
distributed maps.
On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier
emmanuel.castan...@gmail.com wrote:
Thanks for your answer.
In my case, that’s sad cause we have
Thanks Shivaram,
This was the issue. Now I have installed Rscript on all the nodes in Spark
cluster and it works now bith from script as well as R prompt.
Thanks
Stuti Awasthi
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Tuesday, August 19, 2014 1:17 PM
To: Stuti
Hi Amit,
I think the type of the data contained in your RDD needs to be a known case
class and not abstract for createSchemaRDD. This makes sense when you
think it needs to know about the fields in the object to create the schema.
I had the same issue when I used an abstract base class for a
Hi guys,
I want to create two RDD[(K, V)] objects and then collocate partitions with
the same K on one node.
When the same partitioner for two RDDs is used, partitions with the same K
end up being on different nodes.
Here is a small example that illustrates this:
// Let's say I have 10
sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group by
app_id
Error Log
14/08/19 17:58:26 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 158.6 KB, free 294.7 MB)
Exception in thread main java.lang.RuntimeException: [1.36] failure:
Hi all,
I'm doing some testing on a small dataset (HadoopRDD, 2GB, ~10M records), with
a cluster of 3 nodes
Simple calculations like count take approximately 5s when using the default
value of executor.memory (512MB). When I scale this up to 2GB, several Tasks
take 1m or more (while most
Looks like 1 worker is doing the job. Can you repartition the RDD? Also
what is the number of cores that you allocated? Things like this, you can
easily identify by looking at the workers webUI (default worker:8081)
Thanks
Best Regards
On Tue, Aug 19, 2014 at 6:35 PM, Laird, Benjamin
Given a fixed amount of memory allocated to your workers, more memory per
executor means fewer executors can execute in parallel. This means it takes
longer to finish all of the tasks. Set high enough, and your executors can
find no worker with enough memory and so they all are stuck waiting for
Thanks Akhil and Sean.
All three workers are doing the work and tasks stall simultaneously on all
three. I think Sean hit on my issue. I've been under the impression that each
application has one executor process per worker machine (not per core per
machine). Is that incorrect? If an executor
Unfortunately, After some research I found its just a side effect of how
closure containing var works in scala:
http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined
the closure keep referring var broadcasted wrapper as a pointer,
/CLUSTER1/MAIN/tmp/spark-1.0.1-bin-mapr3.tgz'
to
'/tmp/mesos/slaves/20140815-101817-3334820618-5050-32618-0/frameworks/20140819-101702-3334820618-5050-16778-0001/executors/20140815-101817-3334820618-5050-32618-0/runs/e56fffbe-942d-4b15-a798-a00401387927'
cp: cannot stat
`/usr/local/mesos-0.18.1
Hi,
Is there any way view history of applications statistics in master ui after
restarting master server? I have all logs ing /tmp/spark-events/ but when I
start history server in this directory it says No Completed Applications
Found. Maybe I could copy this logs to dir used by master server but
Thanks for the quick and clear response! I now have a better understanding
of what is going on regarding the driver and worker nodes which will help me
greatly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-RabbitMQ-tp11283p12386.html
Sent
Hi ,
We have ~ 1TB of data to process , but our cluster doesn't have
sufficient memory for such data set. ( we have 5-10 machine cluster).
Is it possible to process 1TB data using ON DISK options using spark?
If yes where can I read about the configuration for ON DISK executions.
Thanks
Hi,
I am writing some Scala code to normalize a stream of logs using an
input configuration file (multiple regex patterns). To avoid
re-starting the job, I can read in a new config file using fileStream
and then turn the config file to a map. But I am unsure about how to
update a shared map
Hello,I have a relatively simple python program that works just find in local
most (--master local) but produces a strange error when I try to run it via
Yarn ( --deploy-mode client --master yarn) or just execute the code through
pyspark.Here's the code:sc = SparkContext(appName=foo)input =
Could you post the completed stacktrace?
On Tue, Aug 19, 2014 at 10:47 AM, Aaron aaron.doss...@target.com wrote:
Hello, I have a relatively simple python program that works just find in
local most (--master local) but produces a strange error when I try to run
it via Yarn ( --deploy-mode
We see this all the time as well, I don't the believe there is much a
relationship before the Spark job status and the what Yarn shows as the
status.
On Mon, Aug 11, 2014 at 3:17 PM, Shay Rojansky r...@roji.org wrote:
Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0)
It seems that Python jobs
Sure thing, this is the stacktrace from pyspark. It's repeated a few times,
but I think this is the unique stuff.
Traceback (most recent call last):
File stdin, line 1, in module
File
/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py,
line 583, in collect
Sean, would this work --
rdd.mapPartitions { partition = Iterator(partition) }.foreach(
// Some setup code here
// save partition to DB
// Some cleanup code here
)
I tried a pretty simple example ... I can see that the setup and
cleanup are executed on the executor node, once per
Hi,
Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was
using Singapore region.) Some of the instances we got without the ephemeral
internal (non-EBS) SSD devices that are supposed to be connected to them.
Some of them have these drives but not all, and there is no sign
I think you're looking for foreachPartition(). You've kinda hacked it
out of mapPartitions(). Your case has a simple solution, yes. After
saving to the DB, you know you can close the connection, since you
know the use of the connection has definitely just finished. But it's
not a simpler solution
These three lines of python code cause the error for me:
sc = SparkContext(appName=foo)
input = sc.textFile(hdfs://[valid hdfs path])
mappedToLines = input.map(lambda myline: myline.split(,))
The file I'm loading is a simple CSV.
--
View this message in context:
I'm a Scala / Spark / GraphX newbie, so may be missing something obvious.
I have a set of edges that I read into a graph. For an iterative
community-detection algorithm, I want to assign each vertex to a community
with the name of the vertex. Intuitively it seems like I should be able to
pull
This script run very well without your CSV file. Could download you
CSV file into local disks, and narrow down to the lines which triggle
this issue?
On Tue, Aug 19, 2014 at 12:02 PM, Aaron aaron.doss...@target.com wrote:
These three lines of python code cause the error for me:
sc =
Thank you but how do you convert the stream to parquet file?
Ali
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-reading-hfds-files-every-5-minutes-tp12359p12401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
(+user)
On Tue, Aug 19, 2014 at 12:05 PM, spr s...@yarcdata.com wrote:
I want to assign each vertex to a community with the name of the vertex.
As I understand it, you want to set the vertex attributes of a graph to the
corresponding vertex ids. You can do this using Graph#mapVertices [1] as
I think you have to explicitly list the ephemeral disks in the device
map when launching the EC2 instance.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html
On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak
andras.bar...@lynxanalytics.com wrote:
Hi,
Using the
ankurdave wrote
val g = ...
val newG = g.mapVertices((id, attr) = id)
// newG.vertices has type VertexRDD[VertexId], or RDD[(VertexId,
VertexId)]
Yes, that worked perfectly. Thanks much.
One follow-up question. If I just wanted to get those values into a vanilla
variable (not a
I've set up a YARN (Hadoop 2.4.1) cluster with Spark 1.0.1 and I've
been seeing some inconsistencies with out of memory errors
(java.lang.OutOfMemoryError: unable to create new native thread) when
increasing the number of executors for a simple job (wordcount).
The general format of my submission
Hi,
is there a way such that I can group items in an RDD together such that I
can process them using parallelize/map
Let's say I have data items with keys 1...1000 e.g.
loading RDD = sc. newAPIHadoopFile(...).cache()
Now, I would like them to be processed in chunks of e.g. tens
groupBy seems to be exactly what you want.
val data = sc.parallelize(1 to 200)
data.groupBy(_ % 10).values.map(...)
This would let you process 10 Iterable[Int] in parallel, each of which
is 20 ints in this example.
It may not make sense to do this in practice, as you'd be shuffling a
lot of
At 2014-08-19 12:47:16 -0700, spr s...@yarcdata.com wrote:
One follow-up question. If I just wanted to get those values into a vanilla
variable (not a VertexRDD or Graph or ...) so I could easily look at them in
the REPL, what would I do? Are the aggregate data structures inside the
Hi Xiangrui,
Training data: 42945 s out of 124659.
Test data: 42722 s out of 125341.
The ratio is very much the same. I tried Decision Tree. It outputs 0 to 1
decimals. I don't quite understand it yet.
Would feature scaling make it work for Naive Bayes?
Phuoc Do
On Tue, Aug 19, 2014 at 12:51
When trying to use KMeans.train with some large data and 5 worker nodes, it
would due to BlockManagers shutting down because of timeout. I was able to
prevent that by adding
spark.storage.blockManagerSlaveTimeoutMs 300
to the spark-defaults.conf.
However, with 1 Million feature vectors,
I have a simple spark job that seems to hang when saving to hdfs. When
looking at the spark web ui, the job reached 97 of 100 tasks completed. I
need some help determining why the job appears to hang. The job hangs on
the saveAsTextFile() call.
Reviving this thread hoping I might be able to get an exact snippet for the
correct way to do this in Scala. I had a solution for OpenCV that I thought
was correct, but half the time the library was not loaded by time it was
needed.
Keep in mind that I am completely new at Scala, so you're going
Hi Rafeeq,
I think the following part triggered the bug
https://issues.apache.org/jira/browse/SPARK-2908.
[{*href:null*,rel:me}]
It has been fixed. Can you try spark master and see if the error get
resolved?
Thanks,
Yin
On Mon, Aug 11, 2014 at 3:53 AM, rafeeq s rafeeq.ec...@gmail.com wrote:
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking
this issue.
On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com
wrote:
Thanks, Zhan for the follow up.
But, do you know how I am supposed to set that table name on the jobConf? I
don't have
Hi,
The SQLParser used by SQLContext is pretty limited. Instead, can you try
HiveContext?
Thanks,
Yin
On Tue, Aug 19, 2014 at 7:57 AM, wan...@testbird.com wan...@testbird.com
wrote:
sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group
by app_id
*Error Log*
14/08/19
Hi All,
I have the following code and if the dstream is empty spark streaming writes
empty files ti hdfs. How can I prevent it?
val ssc = new StreamingContext(sparkConf, Minutes(1))
val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
Thanks! Yeah, it may be related to that. I'll check out that pull request
that was sent and hopefully that fixes the issue. I'll let you know, after
fighting with this issue yesterday I had decided to just leave it on the
side and return to it after, so it may take me a while to get back to you.
update: hangs even when not writing to hdfs. I changed the code to avoid
saveAsTextFile() and instead do a forEachParitition and log the results.
This time it hangs at 96/100 tasks, but still hangs.
I changed the saveAsTextFile to:
stringIntegerJavaPairRDD.foreachPartition(p - {
Not sure if this is helpful or not, but in one executor stderr log, I found
this:
14/08/19 20:17:04 INFO CacheManager: Partition rdd_5_14 not found, computing
it
14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/08/19
Is there more documentation on using spark-submit with Yarn? Trying to
launch a simple job does not seem to work.
My run command is as follows:
/opt/cloudera/parcels/CDH/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 10g \
--driver-memory 10g \
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja aahuj...@gmail.com wrote:
/opt/cloudera/parcels/CDH/bin/spark-submit \
--master yarn \
--deploy-mode client \
This should be enough.
But when I view the job 4040 page, SparkUI, there is a single executor (just
the driver node) and I see
Yes, the application is overwriting it - I need to pass it as argument to
the application otherwise it will be set as local.
Thanks for the quick reply! Also, yes now the appTrackingUrl is set
properly as well, before it just said unassigned.
Thanks!
Arun
On Tue, Aug 19, 2014 at 5:47 PM,
Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this
problem as for groupBy() I need to collect() data before applying
parallelize(), which is expensive.
--
View this message in context:
it is definitively a bug, sqlContext.parquetFile should take both dir and
single file as parameter.
this if-check for isDir make no sense after this commit
https://github.com/apache/spark/pull/1370/files#r14967550
i opened a ticket for this issue
https://issues.apache.org/jira/browse/SPARK-3138
The --master should override any other ways of setting the Spark
master.
Ah yes, actually you can set spark.master directly in your application
through SparkConf. Thanks Marcelo.
2014-08-19 14:47 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja
Without the sc.union, my program crashes with the following error:
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Master removed our application: FAILED at
Scheduling Delay is the time required to assign a task to an available
resource.
if you're seeing large scheduler delays, this likely means that other
jobs/tasks are using up all of the resources.
here's some more info on how to setup Fair Scheduling versus the default
FIFO Scheduler:
Hi folks,
We've posted the first Tachyon meetup, which will be on August 25th and is
hosted by Yahoo! (Limited Space):
http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there!
Best,
Haoyuan
--
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/
Fantastic!
Sent while mobile. Pls excuse typos etc.
On Aug 19, 2014 4:09 PM, Haoyuan Li haoyuan...@gmail.com wrote:
Hi folks,
We've posted the first Tachyon meetup, which will be on August 25th and is
hosted by Yahoo! (Limited Space):
http://www.meetup.com/Tachyon/events/200387252/ . Hope
this would be awesome. did a jira get created for this? I searched, but
didn't find one.
thanks!
-chris
On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
Thanks a lot Xiangrui. This will help.
On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com
Hi All,
Is there any example of MLlib decision tree handling categorical variables? My
dataset includes few categorical variables (20 out of 100 features) so was
interested in knowing how I can use the current version of decision tree
implementation to handle this situation? I looked at the
Hey all,
Other than reading the source (not a bad idea in and of iteself;
something I will get to soon) I was hoping to find some high-level
implementation documentation. Can anyone point me to such a document(s)?
Thank you in advance.
-Kenny
--
:SIG:!0x1066BA71A5F56C58!:
Hi Matt,
I checked in the YARN code and I don't see any references to
yarn.resourcemanager.address. Have you made sure that your YARN client
configuration on the node you're launching from contains the right configs?
-Sandy
On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell matt.narr...@gmail.com
perhaps creating Fair Scheduler Pools might help? there's no way to pin
certain nodes to a pool, but you can specify minShares (cpu's). not sure
if that would help, but worth looking in to.
On Tue, Jul 8, 2014 at 7:37 PM, haopu hw...@qilinsoft.com wrote:
In a standalone cluster, is there way
The categorical features must be encoded into indices starting from 0:
0, 1, ..., numCategories - 1. Then you can provide the
categoricalFeatureInfo map to specify which columns contain
categorical features and the number of categories in each. Joseph is
updating the user guide. But if you want to
No. Please create one but it won't be able to catch the v1.1 train. -Xiangrui
On Tue, Aug 19, 2014 at 4:22 PM, Chris Fregly ch...@fregly.com wrote:
this would be awesome. did a jira get created for this? I searched, but
didn't find one.
thanks!
-chris
On Tue, Jul 8, 2014 at 1:30 PM,
ö_ö you should send this message to hbase user list, not spark user list...
but i can give you some personal advice about this, keep column families as
few as possible!
at least, use some prefix of column qualifier could also be an idea. but
read performance may be worse for your use case like
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui
On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net
The ratio should be okay. Could you try to pre-process the data and
map -999.0 to 0 before calling NaiveBayes? Btw, I added a check to
ensure nonnegative features values:
https://github.com/apache/spark/pull/2038
-Xiangrui
On Tue, Aug 19, 2014 at 1:39 PM, Phuoc Do phu...@vida.io wrote:
Hi
bq. does not do well with anything above two or three column families
Current hbase releases, such as 0.98.x, would do better than the above.
5 column families should be accommodated.
Cheers
On Tue, Aug 19, 2014 at 3:06 PM, Wei Liu wei@stellarloyalty.com wrote:
We are doing schema
Chutium, thanks for your advices. I will check out your links.
I sent the email to the wrong email address! Sorry for the spam.
Wei
On Tue, Aug 19, 2014 at 4:49 PM, chutium teng@gmail.com wrote:
ö_ö you should send this message to hbase user list, not spark user
list...
but i can
there is no collect_list in hive 0.12
try this after this ticket is done
https://issues.apache.org/jira/browse/SPARK-2706
i am also looking forward to this.
--
View this message in context:
Hi,
On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin mcgloin.patr...@gmail.com
wrote:
I think the type of the data contained in your RDD needs to be a known
case class and not abstract for createSchemaRDD. This makes sense when
you think it needs to know about the fields in the object to
I replaced -999.0 with 0. Predictions still have same label. Maybe negative
feature really messes it up.
On Tue, Aug 19, 2014 at 4:51 PM, Xiangrui Meng men...@gmail.com wrote:
The ratio should be okay. Could you try to pre-process the data and
map -999.0 to 0 before calling NaiveBayes? Btw, I
That might not be enough. Reflection is used to determine what the
fields are, thus your class might actually need to have members
corresponding to the fields in the table.
I heard that a more generic method of inputting stuff is coming.
On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer
Hi,
From the documentation I think only the model fitting part is implement, what
about the various hypothesis test and performance indexes used to evaluate the
model fit?
Regards,
Xiaobo Gu
Hi Evan, Patrick and Tobias,
So, It worked for what I needed it to do. I followed Yana's suggestion of
using parameterized type of [T : Product:ClassTag:TypeTag]
more concretely, I was trying to make the query process a bit more fluent
-some pseudocode but with correct types
val
Hi,
Any further info on this??
Do you think it would be useful if we have a in memory buffer implemented
that stores the content of the new RDD. In case the buffer reaches a
configured threshold, content of the buffer are spilled to the local disk.
This saves us from OutOfMememory Error.
I'm trying to use Spark to process some data using some native function's
I've integrated using JNI and I pass around a lot of memory I've allocated
inside these functions. I'm not very familiar with the JVM, so I have a
couple of questions.
(1) Performance seemed terrible until I LD_PRELOAD'ed
83 matches
Mail list logo