Hi,
This seems not fixed yet.
I filed an issue in jira: https://issues.apache.org/jira/browse/SPARK-5505
Greg
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-tp19570p21471.html
Sent from the Apache Spark User
Hello Sachin,
While Akhil's solution is correct, this is not sufficient for your usecase.
RDD.foreach (that Akhil is using) will run on the workers, but you are
creating the Producer object on the driver. This will not work, a producer
create on the driver cannot be used from the worker/executor.
Snapshot builds are not published. Unless you build and install snapshots
locally (like with mvn install) they wont be found.
On Feb 2, 2015 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I'm trying to use the master version of spark. I build and install it with
$ mvn clean
This is an issue that is hard to resolve without rearchitecting the whole
Kafka Receiver. There are some workarounds worth looking into.
http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E
On Mon, Feb 2, 2015
Or you can use this Low Level Kafka Consumer for Spark :
https://github.com/dibbhatt/kafka-spark-consumer
This is now part of http://spark-packages.org/ and is running successfully
for past few months in Pearson production environment . Being Low Level
consumer, it does not have this re-balancing
Hi Team,
Does spark support impersonation?
For example, when spark on yarn/hive/hbase/etc..., which user is used by
default?
The user which starts the spark job?
Any suggestions related to impersonation?
--
Thanks,
www.openkb.info
(Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
That's what I did.
On Mon, Feb 2, 2015 at 11:28 PM, Sean Owen so...@cloudera.com wrote:
Snapshot builds are not published. Unless you build and install snapshots
locally (like with mvn install) they wont be found.
On Feb 2, 2015 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
Hi,
I know two options, one for spark_submit, the other one for spark-shell, but
how to set for programs running inside eclipse?
Regards,
Hi All,
I am trying to run Kafka Word Count Program.
please find below, the link for the same
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
I have set spark master to setMaster(local[*])
and I have
Hello,
I'm using Apache Spark Streaming 1.2.0 and trying to define a file filter
for file names when creating an InputDStream
https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/dstream/InputDStream.html
by invoking the fileStream
You can also do something like
rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = {
while(iter.hasNext) iter.next()
})
On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving
There is a file $SPARK_HOME/conf/spark-env.sh which comes readily configured
with the MASTER variable. So if you start pyspark or spark-shell from the ec2
login machine you will connect to the Spark master.
On 29 Jan 2015, at 01:11, Mohit Singh mohit1...@gmail.com wrote:
Hi,
Probably a
That may be the cause of your issue. Take a look at the tuning guide[1] and
maybe also profile your application. See if you can reuse your objects.
1. http://spark.apache.org/docs/latest/tuning.html
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Hi Sean,
Kafka Producer is working fine.
This is related to Spark.
How can i configure spark so that it will make sure to remember count from the
beginning.
If my log.text file has
spark
apache
kafka
spark
My Spark program gives correct output as
spark 2
apache 1
kafka 1
but when I
First I would check your code to see how you are pushing records into the
topic. Is it reading the whole file each time and resending all of it?
Then see if you are using the same consumer.id on the Spark side. Otherwise
you are not reading from the same offset when restarting Spark but instead
I got the same problem, maybe java serializer is unstable
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks, Sonal.
But it seems to be an error happened when “cleaning broadcast”?
BTW, what is the timeout of “[30 seconds]”? can I increase it?
Best,
Yifan LI
On 02 Feb 2015, at 11:12, Sonal Goyal sonalgoy...@gmail.com wrote:
That may be the cause of your issue. Take a look at the
You can use updateStateByKey() to perform the above operation.
On Mon, Feb 2, 2015 at 4:29 PM, Jadhav Shweta jadhav.shw...@tcs.com wrote:
Hi Sean,
Kafka Producer is working fine.
This is related to Spark.
How can i configure spark so that it will make sure to remember count from
the
I think this broadcast cleaning(memory block remove?) timeout exception was
caused by:
15/02/02 11:48:49 ERROR TaskSchedulerImpl: Lost executor 13 on
small18-tap1.common.lip6.fr: remote Akka client disassociated
15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent
yes jobs run as the user that launched them.
if you want to run jobs on a secure cluster then use yarn. hadoop
standalone does not support secure hadoop.
On Mon, Feb 2, 2015 at 5:37 PM, Jim Green openkbi...@gmail.com wrote:
Hi Team,
Does spark support impersonation?
For example, when spark
Alright.. I found the issue. I wasn't setting fs.s3.buffer.dir
configuration. Here is the final spark conf snippet that works:
spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
spark.hadoop.fs.s3bfs.impl:
On 01/29/2015 08:31 PM, Ankur Dave wrote:
Thanks for the reminder. I just created a PR:
https://github.com/apache/spark/pull/4273
Ankur
Hello,
Thanks for the patch. I applied it on Pregel.scala (in Spark 1.2.0 sources) and
rebuilt
Spark. During execution, at the 25th iteration of Pregel,
Greetings!
SPARK-1476 says that there is a 2G limit for blocks.Is this the same as a 2G
limit for partitions (or approximately so?)?
What I had been attempting to do is the following.1) Start with a moderately
large data set (currently about 100GB, but growing).2) Create about 1,000 files
Hi All,
Is there a way to disable the Spark UI? What I really need is to stop the
startup of the Jetty server.
--
Thanks regards,
Nirmal
Senior Software Engineer- Platform Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
I think you can configure hadoop/hive to do impersonation. There is no
difference between secure or insecure hadoop cluster by using kinit.
Thanks.
Zhan Zhang
On Feb 2, 2015, at 9:32 PM, Koert Kuipers
ko...@tresata.commailto:ko...@tresata.com wrote:
yes jobs run as the user that launched
Hi Team,
I just spent some time these 2 weeks on Scala and tried all Scala on Spark
functions in the Spark Programming Guide
http://spark.apache.org/docs/1.2.0/programming-guide.html.
If you need example codes of Scala on Spark functions, I created this cheat
sheet
We're planning to use this as well (Dibyendu's
https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for
the efforts. So far its working nicely. I think there is merit in make it
the default Kafka Receiver for spark streaming.
-neelesh
On Mon, Feb 2, 2015 at 5:25 PM, Dibyendu
The limit is on blocks, not partitions. Partitions have many blocks.
It sounds like you are creating very large values in memory, but I'm
not sure given your description. You will run into problems if a
single object is more than 2GB, of course. More of the stack trace
might show what is mapping
Thanks Neelesh . Glad to know this Low Level Consumer is working for you.
Dibyendu
On Tue, Feb 3, 2015 at 8:06 AM, Neelesh neele...@gmail.com wrote:
We're planning to use this as well (Dibyendu's
https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for
the efforts. So far its
Mine was not really a moving average problem. It was more like partitioning
on some keys and sorting(on different keys) and then running a sliding
window through the partition. I reverted back to map-reduce for that(I
needed secondary sort, which is not very mature in Spark right now).
But, as
You can utilize the following method:
https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream%28java.lang.String,%20scala.Function1,%20boolean,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29
It has a parameter:
Hi,
given the current open issue:
https://issues.apache.org/jira/browse/SPARK-5508 I cannot use HiveQL to
insert schemaRDD data into a table if one of the columns is an Array of
Struct.
using the spark API, Is it possible to insert schema RDD into an existing
and *partitioned* table ?
the method
Thanks Zhan! Was this introduced from Spark 1.2? or is this available in
Spark 1.1 ?
On Tue, Feb 3, 2015 at 11:52 AM, Zhan Zhang zzh...@hortonworks.com wrote:
You can set spark.ui.enabled to false to disable ui.
Thanks.
Zhan Zhang
On Feb 2, 2015, at 8:06 PM, Nirmal Fernando
Hi,
I added checkpoint directory and now Using updateStateByKey()
import com.google.common.base.Optional;
Function2ListInteger, OptionalInteger, OptionalInteger updateFunction =
new Function2ListInteger, OptionalInteger, OptionalInteger() {
@Override public OptionalInteger
I am not sure what Loading status means, followed by Running. In the
application UI, I see:
Executor Summary
ExecutorID Worker Cores Memory State Logs
1 worker-20150202144112-hadoop-w-1.c.fi-mdd-poc.internal-3887416
83971
LOADING stdout stderr
0
Yes
On Monday, February 2, 2015, Mark Hamstra m...@clearstorydata.com wrote:
LOADING is just the state in which new Executors are created but before
they have everything they need and are fully registered to transition to
state RUNNING and begin doing actual work:
LOADING is just the state in which new Executors are created but before
they have everything they need and are fully registered to transition to
state RUNNING and begin doing actual work:
Hi all,
I'm trying to use the master version of spark. I build and install it with
$ mvn clean clean install
I manage to use it with the following configuration in my build.sbt :
*libraryDependencies ++= Seq( org.apache.spark %% spark-core %
1.3.0-SNAPSHOT % provided, org.apache.spark %%
It seems sort of Listener UI error! I say this because, I see the status
in the executor web UI to be loading, but the application UI, for the
same executor the status is Running! I have also seen the reverse behavior
where the application UI indicates a particular executor as loading, but
the
Here you go:
JavaDStreamString textStream =
ssc.textFileStream(/home/akhld/sigmoid/);
textStream.foreachRDD(new FunctionJavaRDDString,Void() {
@Override
public Void call(JavaRDDString rdd) throws Exception {
// TODO Auto-generated method stub
rdd.foreach(new
Yes, if the Master is unable to register the Executor and transition it to
RUNNING, then the Executor will stay in LOADING state, so this can be
caused by problems in the Master or the Master-Executor communication.
On Mon, Feb 2, 2015 at 9:24 AM, Tushar Sharma tushars...@gmail.com wrote:
Yes
Curious. I guess the first question is whether we've got some sort of
Listener/UI error so that the UI is not accurately reflecting the
Executor's actual state, or whether the LOADING Executor really is
spending a considerable length of time in this I'm in the process of being
created, but not
Hi Emre,
This is how you do that in scala:
val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat](/home/akhld/sigmoid, (t: Path) = true, true)
In java you can do something like:
jssc.ssc().LongWritable, Text,
SequenceFileInputFormatfileStream(/home/akhld/sigmoid, new
Yes it would, you can create a key and then partition it (say
HashPartitioner) and then joining would be faster as all the similar keys
will go in one partition.
Thanks
Best Regards
On Sun, Feb 1, 2015 at 5:13 PM, Sunita Arvind sunitarv...@gmail.com wrote:
Hi All
We are joining large tables
Hi all,
I have some questions about the future development of Spark's standalone
resource scheduler. We've heard some users have the requirements to have
multi-tenant support in standalone mode, like multi-user management, resource
management and isolation, whitelist of users. Seems current
Hey Jerry,
I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose
Hi Patrick,
Thanks a lot for your detailed explanation. For now we have such requirements:
whitelist the application submitter, user resources (CPU, MEMORY) quotas,
resources allocations in Spark Standalone mode. These are quite specific
requirements for production-use, generally these problem
47 matches
Mail list logo