Hello,
I would like to use spark streaming over a REST api to get information along
the time and with diferent parameters in the REST query.
I was thinking to use apache kafka but I don´t have any experience with this
and I would like to have some advice about this.
Thanks.
Best regards,
we = Sigmoid
back-pressuring mechanism = Stoping the receiver from receiving more
messages when its about to exhaust the worker memory. Here's a similar
https://issues.apache.org/jira/browse/SPARK-7398 kind of proposal if you
haven't seen already.
Thanks
Best Regards
On Mon, May 18, 2015 at
Tried almost all the options, but it did not work. So, I ended up creating
a new IAM user and the keys of this user are working fine. I am not getting
Forbidden(403) exception now, but my program seems to be running
infinitely. It's not throwing any exception, but continues to run
continuously
My first thought would be creating 10 rdds and run your word count on each
of them..I think spark scheduler is going to resolve dependency in parallel
and launch 10 jobs.
Best
Ayan
On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote:
Hi,
Consider I have a tab delimited text
Hi ,
I am looking a way to pass configuration parameters to spark job.
In general I have quite simple PySpark job.
def process_model(k, vc):
do something
sc = SparkContext(appName=TAD)
lines = sc.textFile(input_job_files)
result =
Hi Team,
My dataset has the following format:
CELLPHONE,KL_1,KL_2,KL_3,KL_4,KL_5
1120100114,-5.3244062521117e-003,-4.10825709805041e-003,-1.7816995027779e-002,-4.21462029980323e-003,-1.6200555039e-002
i.e., a reader in the first column and the data separated by comas. To load
this data I’m
Who are “we” and what is the mysterious “back-pressuring mechanism” and is it
part of the Spark Distribution (are you talking about implementation of the
custom feedback loop mentioned in my previous emails below)- asking these
because I can assure you that at least as of Spark Streaming 1.2.0,
Hi,
I have two streaming RDD1 and RDD2 and want to cogroup them.
Data don't come in the same time and sometimes they could come with some
delay.
When I get all data I want to insert in MongoDB.
For example, imagine that I get:
RDD1 -- T 0
RDD2 --T 0.5
I do cogroup between them but I couldn't
Hi,
I would really recommend you to put your Flink and Spark dependencies into
different maven modules.
Having them both in the same project will be very hard, if not impossible.
Both projects depend on similar projects with slightly different versions.
I would suggest a maven module structure
We fix the receivers rate at which it should consume at any given point of
time. Also we have a back-pressuring mechanism attached to the receivers so
it won't simply crashes in the unceremonious way like Evo said. Mesos has
some sort of auto-scaling (read it somewhere), may be you can look into
Ooow – that is essentially the custom feedback loop mentioned in my previous
emails in generic Architecture Terms and what you have done is only one of the
possible implementations moreover based on Zookeeper – there are other possible
designs not using things like zookeeper at all and hence
Hi Akhil, The python wrapper for Spark Job Server did not help me. I actually
need the pyspark code sample which shows how I can call a function from 2
threads and execute it simultaneously. Thanks Regards,
Meethu M
On Thursday, 14 May 2015 12:38 PM, Akhil Das
Hi
So to be clear, do you want to run one operation in multiple threads within
a function or you want run multiple jobs using multiple threads? I am
wondering why python thread module can't be used? Or you have already gave
it a try?
On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in
Hi all,
Currently, I wrote some code lines to access spark master which was deployed
on standalone style. I wanted to set the breakpoint for spark master which
was running on the different process. I am wondering maybe I need attach
process in IntelliJ, so that when AppClient sent the message to
I'm having issues with submitting a Spark Yarn job in cluster mode when the
cluster filesystem is file:///. It seems that additional resources
(--py-files) are simply being skipped and not being added into the
PYTHONPATH. The same issue may also exist for --jars, --files, etc.
We use a simple NFS
Oh BTW, it's spark 1.3.1 on hadoop 2.4. AIM 3.6.
Sorry for lefting out this information.
Appreciate for any help!
Ed
2015-05-18 12:53 GMT-04:00 edward cui edwardcu...@gmail.com:
I actually have the same problem, but I am not sure whether it is a spark
problem or a Yarn problem.
I set up a
I have played a bit with the directStream kafka api. Good work cody. These
are my findings and also can you clarify a few things for me (see below).
- When auto.offset.reset- smallest and you have 60GB of messages in
Kafka, it takes forever as it reads the whole 60GB at once. largest will
only
Hi, I'm getting this exception after shifting my code from Spark 1.2 to Spark
1.3
15/05/18 18:22:39 WARN TaskSetManager: Lost task 0.0 in stage 1.6 (TID 84,
cloud8-server): FetchFailed(BlockManagerId(1, cloud4-server, 7337),
shuffleId=0, mapId=9, reduceId=1, message=
I am currently using spark streaming. During my batch processing I must
groupByKey. Afterwards I call foreachRDD foreachPartition write to an
external datastore.
My only concern with this is if it's future proof? I know groupByKey by
default uses the hashPartitioner. I have printed out the
Hi Shay,
Yeah, that seems to be a bug; it doesn't seem to be related to the default
FS nor compareFs either - I can reproduce this with HDFS when copying files
from the local fs too. In yarn-client mode things seem to work.
Could you file a bug to track this? If you don't have a jira account I
Thanks for the heads up mate.
On 18 May 2015 19:08, Evo Eftimov evo.efti...@isecc.com wrote:
Ooow – that is essentially the custom feedback loop mentioned in my
previous emails in generic Architecture Terms and what you have done is
only one of the possible implementations moreover based on
How about making the range in the for loop parallelised? The driver will then
kick off the word counts independently.
Regards,
Guy Needham | Data Discovery
Virgin Media | Technology and Transformation | Data
Bartley Wood Business Park, Hook, Hampshire RG27 9UP
D 01256 75 3362
I welcome VSRE
*All
On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Xiaohe,
The all Spark options must go before the jar or they won't take effect.
-Sandy
On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com
wrote:
Sorry, them both are assigned task
Hi Xiaohe,
The all Spark options must go before the jar or they won't take effect.
-Sandy
On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote:
Sorry, them both are assigned task actually.
Aggregated Metrics by Executor
Executor IDAddressTask TimeTotal TasksFailed
I actually have the same problem, but I am not sure whether it is a spark
problem or a Yarn problem.
I set up a five nodes cluster on aws emr, start yarn daemon on the master
(The node manager will not be started on default on the master, I don't
want to waste any resource since I have to pay).
Why not use sparkstreaming to do the computation and dump the result
somewhere in a DB perhaps and take it from there?
Thanks
Best Regards
On Mon, May 18, 2015 at 7:51 PM, juandasgandaras juandasganda...@gmail.com
wrote:
Hello,
I would like to use spark streaming over a REST api to get
My pleasure young man, i will even go beynd the so called heads up and send
you a solution design for Feedback Loop preventing spark streaming app clogging
and resource depletion and featuring machine learning based self-tunning AND
which is not zookeeper based and hence offers lower latency
PR is opened : https://github.com/apache/spark/pull/6237
Le ven. 15 mai 2015 à 17:55, Olivier Girardot ssab...@gmail.com a écrit :
yes, please do and send me the link.
@rxin I have trouble building master, but the code is done...
Le ven. 15 mai 2015 à 01:27, Haopu Wang hw...@qilinsoft.com a
On Fri, May 15, 2015 at 5:09 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Now, we noticed that we get java heap OOM exceptions on the output tracker
when we have too many tasks. I wonder:
1. where does the map output tracker live? The driver? The master (when
those are not the same)?
2.
Most likely, you never call sc.stop().
Note that in 1.4, this will happen for you automatically in a shutdown
hook, taken care of by https://issues.apache.org/jira/browse/SPARK-3090
On Wed, May 13, 2015 at 8:04 AM, Yifan LI iamyifa...@gmail.com wrote:
Hi,
I have some applications
SparkContext can be used in multiple threads (Spark streaming works
with multiple threads), for example:
import threading
import time
def show(x):
time.sleep(1)
print x
def job():
sc.parallelize(range(100)).foreach(show)
threading.Thread(target=job).start()
On Mon, May 18,
You can use sc.hadoopFile (or any of the variants) to do what you want.
They even let you reuse your existing HadoopInputFormats. You should be
able to mimic your old use with MR just fine. sc.textFile is just a
convenience method which sits on top.
imran
On Fri, May 8, 2015 at 12:03 PM, tog
Hi Justin,
It sound like you're on the right track. The best way to write a custom
Evaluator will probably be to modify an existing Evaluator as you
described. It's best if you don't remove the other code, which handles
parameter set/get and schema validation.
Joseph
On Sun, May 17, 2015 at
I am trying to print a basic twitter stream and receiving the following
error:
15/05/18 22:03:14 INFO Executor: Fetching
http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with
timestamp 1432000973058
15/05/18 22:03:14 INFO Utils: Fetching
Yeah, I read that page before, but it does not mention the options should
come before the application jar. Actually, if I put the --class option
before the application jar, I will get ClassNotFoundException.
Anyway, thanks again Sandy.
On Tue, May 19, 2015 at 11:06 AM, Sandy Ryza
I think I found the answer -
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html
Do I have no way of running this in Windows locally?
On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com
wrote:
Hi,
Thanks for the response. But I could not see fillna function in DataFrame class.
[cid:image001.png@01D0920E.32B14460]
Is it available in some specific version of Spark sql. This is what I have in
my pom.xml
dependency
groupIdorg.apache.spark/groupId
I'm not 100% sure that is causing a problem, though. The stream still
starts, but is giving blank output. I checked the environment variables in
the ui and it is running local[*], so there should be no bottleneck there.
On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com
Hi Sandy,
Thanks for your information. Yes, spark-submit --master yarn
--num-executors 5 --executor-cores 4
target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is
working awesomely. Is there any documentations pointing to this ?
Thanks,
Xiaohe
On Tue, May 19, 2015 at 12:07 AM,
Awesome!
It's documented here:
https://spark.apache.org/docs/latest/submitting-applications.html
-Sandy
On Mon, May 18, 2015 at 8:03 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi Sandy,
Thanks for your information. Yes, spark-submit --master yarn
--num-executors 5 --executor-cores 4
Hi,
Just figured out that if I want to perform graceful shutdown of Spark
Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no
longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for
Spark Core, that gets anyway called , which leads to graceful shutdown
depends what you mean by output data. Do you mean:
* the data that is sent back to the driver? that is result size
* the shuffle output? that is in Shuffle Write Metrics
* the data written to a hadoop output format? that is in Output Metrics
On Thu, May 14, 2015 at 2:22 PM, yanwei
I'm not super familiar with this part of the code, but from taking a quick
look:
a) the code creates a MultivariateOnlineSummarizer, which stores 7 doubles
per feature (mean, max, min, etc. etc.)
b) The limit is on the result size from *all* tasks, not from one task.
You start with 3072 tasks
c)
In PySpark, it serializes the functions/closures together with used
global values.
For example,
global_param = 111
def my_map(x):
return x + global_param
rdd.map(my_map)
- Davies
On Mon, May 18, 2015 at 7:26 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am looking a way to
Looks like this exception is after many more failures have occurred. It is
already on attempt 6 for stage 7 -- I'd try to find out why attempt 0
failed.
This particular exception is probably a result of corruption that can
happen when stages are retried, that I'm working on addressing in
Neither of those two. Instead, the shuffle data is cleaned up when the
stage they are from get GC'ed by the jvm. that is, when you are no longer
holding any references to anything which points to the old stages, and
there is an appropriate gc event.
The data is not cleaned up right after the
Hi Suman,
For maxIterations, are you using the DenseKMeans.scala example code? (I'm
guessing yes since you mention the command line.) If so, then you should
be able to specify maxIterations via an extra parameter like
--numIterations 50 (note the example uses numIterations in the current
master
Rather than updating the broadcast variable, can't you simply create a
new one? When the old one can be gc'ed in your program, it will also get
gc'ed from spark's cache (and all executors).
I think this will make your code *slightly* more complicated, as you need
to add in another layer of
Hi all,
I am reading the docs of receiver-based Kafka consumer. The last parameters
of KafkaUtils.createStream is per topic number of Kafka partitions to
consume. My question is, does the number of partitions for topic in this
parameter need to match the number of partitions in Kafka.
For
HI Bill,
You don't need to match the number of thread to the number of partitions in
the specific topic, for example, you have 3 partitions in topic1, but you
only set 2 threads, ideally 1 thread will receive 2 partitions and another
thread for the left one partition, it depends on the scheduling
Hi,
can you take a look at the logs and see what the first error you are
getting is? Its possible that the file doesn't exist when that error is
produced, but it shows up later -- I've seen similar things happen, but
only after there have already been some errors. But, if you see that in
the
Reducing the number of instances won't help in this case. We use the
driver to collect partial gradients. Even with tree aggregation, it
still puts heavy workload on the driver with 20M features. Please try
to reduce the number of partitions before training. We are working on
a more scalable
AFAIK, there are two places where you can specify the driver memory.
One is via spark-summit --driver-memory and the other is via
spark.driver.memory in spark-defaults.conf. Please try these
approaches and see whether they work or not. You can find detailed
instructions at
LogisticRegressionWithSGD doesn't support multi-class. Please use
LogisticRegressionWithLBFGS instead. -Xiangrui
On Mon, Apr 27, 2015 at 12:37 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
With the Python APIs, the available arguments I got (using inspect module)
are the following:
Hi,
I am using spark-sql to read a CSV file and write it as parquet file. I am
building the schema using the following code.
String schemaString = a b c;
ListStructField fields = new ArrayListStructField();
MetadataBuilder mb = new MetadataBuilder();
Thanks, Akhil. So what do folks typically do to increase/contract the capacity?
Do you plug in some cluster auto-scaling solution to make this elastic?
Does Spark have any hooks for instrumenting auto-scaling?
In other words, how do you avoid overwheling the receivers in a scenario when
your
On 16 May 2015, at 04:39, Anton Brazhnyk
anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote:
For me it wouldn’t help I guess, because those newer classes would still be
loaded by different classloader.
What did work for me with 1.3.1 – removing of those classes from Spark’s jar
Hi I'm trying to use broadcast variables in my Spark streaming program.
val conf = new
SparkConf().setMaster(SPARK_MASTER).setAppName(APPLICATION_NAME)
val ssc = new StreamingContext(conf, Seconds(1))
val LIMIT = ssc.sparkContext.broadcast(5L)
println(LIMIT.value) // this print 5
val
Hi,I think you cant supply an initial set of centroids to kmeans Thanks
Regards,
Meethu M
On Friday, 15 May 2015 12:37 AM, Suman Somasundar
suman.somasun...@oracle.com wrote:
!--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria
Math;panose-1:2 4 5 3 5 4 6 3 2 4;}
You can use
spark.streaming.receiver.maxRate
not set
Maximum rate (number of records per second) at which each receiver will receive
data. Effectively, each stream will consume at most this number of records per
second. Setting this configuration to 0 or a negative number will put no
Hi
Give a try with dtaFrame.fillna function to fill up missing column
Best
Ayan
On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan
ananda.muru...@honeywell.com wrote:
Hi,
I am using spark-sql to read a CSV file and write it as parquet file. I am
building the schema
And if you want to genuinely “reduce the latency” (still within the boundaries
of the micro-batch) THEN you need to design and finely tune the Parallel
Programming / Execution Model of your application. The objective/metric here is:
a) Consume all data within your selected micro-batch
62 matches
Mail list logo