Does anyone have any answer.
How does the state distribution happen among multiple nodes.
I have seen that in "mapwithState" based workflow the streaming job simply
hangs when the node containing all states dies because of OOM.
..Manas
--
View this message in context:
Yes,
In my case, my StateSpec had a small partition size. I increased the
numPartitions and the problem went away. (Details of why the problem was
happening in the first place is elided.)
TL;DR
StateSpec takes a "numPartitions" which can be set to high enough number.
--
View this message
Actually each element of mapwithstate has a time out component. You can write
a function to "treat" your time out.
You can match it with your batch size and do fun stuff when the batch ends.
People do session management with the same approach.
When activity is registered the session is
StateSpec has a method numPartitions to set the initial number of partition.
That should do the trick.
...Manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html
Sent from the Apache Spark User List
check out spark packages https://spark-packages.org/ and you will find few
awesome and a lot of super awesome projects.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html
Sent from the Apache Spark User List
It appears that the version against which your program is compiled is
different than that of the spark version you are running your code against.
..Manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-on-yarn-oom-tp27766p27782.html
Sent from
If you are using spark 1.6 onwards there is a better solution for you.
It is called mapwithState
mapwithState takes a state function and an initial RDD.
1) When you start your program for the first time/OR version changes and new
code can't use the checkpoint, the initialRDD comes handy.
2) For
Yes there is.
You can use the default dbcp or your own preferred connection pool manager.
Then when you ask for a connection you get one from the pool.
Take a look at this
https://github.com/manasdebashiskar/kafka-exactly-once
It is forked from Cody's repo.
..Manas
--
View this message
You have not mentioned what task is not serializable.
The stack trace is usually a good idea while asking this question.
Usually spark will tell you what class it is not able to serialize.
If it is one of your own class then try making it serializable or make it
transient so that it only gets
There is a sc.sparkDefaultParallelism parameter that I use to dynamically
maintain elasticity in my application. Depending upon your scenario this
might be enough.
--
View this message in context:
You are on the right track.
The only thing you will have to take care is when two of your partitions try
to access the same connection at the same time.
--
View this message in context:
Your input is skewed in terms of the default hash partitioner that is used.
Your options are to use a custom partitioner that can re-distribute the data
evenly among your executors.
I think you will see the same behaviour when you use more executors. It is
just that the data skew appears to be
Hi,
I have a streaming application that takes data from a kafka topic and uses
mapwithstate.
After couple of hours of smooth running of the application I see a problem
that seems to have stalled my application.
The batch seems to have been stuck after the following error popped up.
Has anyone
It can be easily done using an RDD.
rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your
items.
Here YourCustomPartitioner will know how to pick sample items from each
partition.
If you want to stick to Dataframe you can always repartition the data after
you apply the limit.
Have you tried persisting sourceFrame in (MEMORY_AND_DISK)?
May be you can cache updatedRDD which gets used in next two lines.
Are you sure you are paying the performance penalty because of shuffling
only?
Yes, group by is a killer. How much time does your code spend it GC?
Can't tell if your
Is that the only kind of error you are getting.
Is it possible something else dies that gets buried in other messages.
Try repairing HDFS (fsck etc) to find if blocks are intact.
Few things to check
1) if you have too many small files.
2) Is your system complaining about too many inode etc..
3)
Are you sure you are using Kryo serialization.
You are getting a java serialization error.
Are you setting up your sparkcontext with kryo serialization enabled?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Serialization-in-Spark-tp25628p25678.html
Have you taken a look at trackStateBykey in spark streaming (coming in spark
1.6)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/State-management-in-spark-streaming-tp25608p25681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
you can provide spark ui port at while executing your context. spark.ui.port
can be set to different port.
..Manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Address-already-in-use-after-many-streams-on-Kafka-tp25545p25683.html
Sent from the Apache
partitionBy is a suggestive field.
If your value is bigger then what spark calculates(based on the obvious you
stated) your value will be used.
But repartition is a forced shuffle (but give me exactly required number of
partition) operation.
You might have noticed that repartition caused a bit of
We use ooyala job server. It is great. It has a great set of api's to cancel
job. Create adhoc or persistent context etc.
It has great support in remote deploy and tests too which helps faster
coding.
The current version is missing job progress bar but I could not find the
same in the hidden
We use graphite monitoring.
Currently we miss having email notifications for an alert. Not sure Ganglia
has the same caveat.
..Manas
--
View this message in context:
use zipwithIndex to achieve the same behavior.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-random-number-or-sequence-numbers-tp25623p25679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
how often do you checkpoint?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567p25682.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes,
Text file is schema less. Spark does not know what to skip so it will read
everything.
Parquet as you have stated is capable of taking advantage of predicate push
down.
..Manas
--
View this message in context:
You can use the evil "group by key" and use a conventional method to compare
against each row with in that iterable.
If your similarity function is a n-1 iterable results for n input then you
can use a flatmap to do all that stuff on worker side.
spark also has cartesian product that might help in
Not sure what is your requirement there, but if you have a 2 second streaming
batch , you can create a 4 second stream out of it but the other way is not
possible.
Basically you can create one stream out of another stream.
..Manas
--
View this message in context:
My apologies for making this problem sound bigger then it actually is.
After many more coffee break I discovered that scalikejdbc
ConnectionPool.singleton(NOT_HOST_BUT_JDBC_URL, user, password) takes a url
and not a host(At least for the version 2.3+)
Hence it throws a very legitimate looking
Try `X-P-S-T` (back tick)
..Manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-doesn-t-support-column-names-that-contain-tp25529p25588.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I don't think it is possible that way.
Spark streaming is a minibatch processing system.
If processing contents of 2 batch is your objective what you can do is
1) keep a cache(or two) that represent the previous batch(s).
2) every new batch replaces the old cache by one time slot back.
3) you
spark has capability to report to ganglia, graphite or jmx.
If none of that works for you you can register your own spark extra listener
that does your bidding.
..Manas
--
View this message in context:
You can check the executor thread dump to see what you "stuck" executor are
doing.
if you are running some monitoring tool then you can check if there is a
heavy io going on (or network usage during that time)
A little more info on what you are trying to do would be help.
..manas
--
View this
SortByKey is available for pairedRDD.
So if you RDD can be implicitly transformed to a pairedRDD
you can do SortByKey then on.
This magic is implicitly available if you import
org.apache.spark.SparkContext._
--
View this message in context:
What information do you get when you try running with -X flag?
Few notes before you start buildings are
1) Use the latest maven.
2) use java_home just incase.
An example would be
JAVA_HOME=/usr/lib/jvm/java-8-oracle ./make-distribution.sh --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0
Usually while processing a DStream one uses foreachRDD.
foreachRDD gives you to deal with an RDD which has a method isEmpty that you
can use.
..Manas
--
View this message in context:
spark ui has a great rest api set.
http://spark.apache.org/docs/latest/monitoring.html
If you know your application id the rest should be easy.
..Manas
--
View this message in context:
You can use a custom partitioner if your need is specific in any way.
If you care about ordering then you can zipWithIndex your rdd and decide
based on the sequence of the message.
The following partitioner should work for you.
class ExactPartitioner[V](
partitions: Int,
elements: Int)
Depends on your need.
Have you looked at Elastic search, or Accumulo or Cassandra?
If post processing of your data is not your motive and you want to just
retrieve the data later greenplum(based on postgresql) can be an
alternative.
in short there are many NOSQL out there with each having
When you enable check pointing your offsets get written in zookeeper. If you
program dies or shutdowns and later restarted kafkadirectstream api knows
where to start by looking at those offsets from zookeeper.
This is as easy as it gets.
However if you are planning to re-use the same checkpoint
If you look at your spark ui-> Environment you can see what are the path
pointing to /tmp
Typically java temp folder is also mapped to /tmp which can be over ridden
by java opt .
Spark logs go in /var/run/spark/work/... folder but I think you already know
that.
..Manas
--
View this message in
There are steps to build spark using scala 2.11 in the spark docs.
the first step is
/dev/change-scala-version.sh 2.11 which changes the scala version to 2.11.
I have not tried compiling spark with akka 2.4.0.
..Manas
--
View this message in context:
Try file:///path
an example would be file://localhost/tmp/text.txt
or file://192.168.1.25/tmp/text.txt
...manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpointing-restrict-checkpointing-to-local-file-system-tp25468p25596.html
Sent from the
Hi All,
Has anyone tried using user defined database api for postgres on Spark
1.5.0 onwards.
I have a build that uses
Spark = 1.5.1
ScalikeJDBC= 2.3+
postgres driver = postgresql-9.3-1102-jdbc41.jar
Spark SQL API to write dataframe to postgres works.
But writing a spark RDD to postgres using
Hi,
I have a spark application that hangs on doing just one task (Rest 200-300
task gets completed in reasonable time)
I can see in the Thread dump which function gets stuck how ever I don't
have a clue as to what value is causing that behaviour.
Also, logging the inputs before the function is
Hi All,
I saw some helps online about forcing avro-mapred to hadoop2 using
classifiers.
Now my configuration is thus
val avro= org.apache.avro % avro-mapred % V.avro classifier
hadoop2
How ever I still get java.lang.IncompatibleClassChangeError. I think I am
not building spark
Hi Experts,
I have recently installed HDP2.2(Depends on hadoop 2.6).
My spark 1.2 is built with hadoop 2.3 profile.
/( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3
-Phive -DskipTests clean package)/
My program has following dependencies
/val avro=
saveAsSequenceFile method works on rdd. your object csv is a String.
If you are using spark-shell you can type your object to know it's datatype.
Some prefer eclipse(and it's intelli) to make their live easier.
..Manas
-
Manas Kar
--
View this message in context:
If you want to take out apple and orange you might want to try
dataRDD.filter(_._2 !=apple).filter(_._2 !=orange) and so on.
...Manas
-
Manas Kar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html
Sent from the
Can you do a countApprox as a condition to check non-empty RDD?
..Manas
-
Manas Kar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html
Sent from the Apache Spark User List mailing list archive
It is wonderful to see some idea.
Now the questions:
1) What is a track segment?
Ans) It is the line that contains two adjacent points when all points are
arranged by time. Say a vehicle moves (t1, p1) - (t2, p2) - (t3, p3).
Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1
Have you checked reactive monitoring(https://github.com/eigengo/monitor) or
kamon monitoring (https://github.com/kamon-io/Kamon)
Instrumenting needs absolutely no code change. All you do is weaving.
In our environment we use Graphite to get the statsd(you can also get
dtrace) events and display
Correction to my question. (5) should read
5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile.
Can someone please guide me in the right direction?
Thanks in advance
Manas
-
Manas Kar
--
View this message in context:
Correction to my question. (5) should read
5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile.
Can someone please guide me in the right direction?
Thanks in advance
Manas
On Fri, Oct 3, 2014 at 11:42 PM, manasdebashiskar [via Apache Spark User
List] ml-node+s1001560n15729
53 matches
Mail list logo