Re: How State of mapWithState is distributed on nodes

2017-01-17 Thread manasdebashiskar
Does anyone have any answer. How does the state distribution happen among multiple nodes. I have seen that in "mapwithState" based workflow the streaming job simply hangs when the node containing all states dies because of OOM. ..Manas -- View this message in context:

Re: mapwithstate Hangs with Error cleaning broadcast

2016-11-02 Thread manasdebashiskar
Yes, In my case, my StateSpec had a small partition size. I increased the numPartitions and the problem went away. (Details of why the problem was happening in the first place is elided.) TL;DR StateSpec takes a "numPartitions" which can be set to high enough number. -- View this message

Re: Can mapWithState state func be called every batchInterval?

2016-10-13 Thread manasdebashiskar
Actually each element of mapwithstate has a time out component. You can write a function to "treat" your time out. You can match it with your batch size and do fun stuff when the batch ends. People do session management with the same approach. When activity is registered the session is

Re: Re-partitioning mapwithstateDstream

2016-10-13 Thread manasdebashiskar
StateSpec has a method numPartitions to set the initial number of partition. That should do the trick. ...Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html Sent from the Apache Spark User List

Re: Open source Spark based projects

2016-09-23 Thread manasdebashiskar
check out spark packages https://spark-packages.org/ and you will find few awesome and a lot of super awesome projects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html Sent from the Apache Spark User List

Re: spark stream on yarn oom

2016-09-22 Thread manasdebashiskar
It appears that the version against which your program is compiled is different than that of the spark version you are running your code against. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-on-yarn-oom-tp27766p27782.html Sent from

Re: Recovered state for updateStateByKey and incremental streams processing

2016-09-17 Thread manasdebashiskar
If you are using spark 1.6 onwards there is a better solution for you. It is called mapwithState mapwithState takes a state function and an initial RDD. 1) When you start your program for the first time/OR version changes and new code can't use the checkpoint, the initialRDD comes handy. 2) For

Re: Spark and DB connection pool

2016-03-25 Thread manasdebashiskar
Yes there is. You can use the default dbcp or your own preferred connection pool manager. Then when you ask for a connection you get one from the pool. Take a look at this https://github.com/manasdebashiskar/kafka-exactly-once It is forked from Cody's repo. ..Manas -- View this message

Re: Serialization issue with Spark

2016-03-25 Thread manasdebashiskar
You have not mentioned what task is not serializable. The stack trace is usually a good idea while asking this question. Usually spark will tell you what class it is not able to serialize. If it is one of your own class then try making it serializable or make it transient so that it only gets

Re: Best way to determine # of workers

2016-03-25 Thread manasdebashiskar
There is a sc.sparkDefaultParallelism parameter that I use to dynamically maintain elasticity in my application. Depending upon your scenario this might be enough. -- View this message in context:

Re: Create one DB connection per executor

2016-03-25 Thread manasdebashiskar
You are on the right track. The only thing you will have to take care is when two of your partitions try to access the same connection at the same time. -- View this message in context:

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar
Your input is skewed in terms of the default hash partitioner that is used. Your options are to use a custom partitioner that can re-distribute the data evenly among your executors. I think you will see the same behaviour when you use more executors. It is just that the data skew appears to be

mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manasdebashiskar
Hi, I have a streaming application that takes data from a kafka topic and uses mapwithstate. After couple of hours of smooth running of the application I see a problem that seems to have stalled my application. The batch seems to have been stuck after the following error popped up. Has anyone

Re: Problem using limit clause in spark sql

2015-12-25 Thread manasdebashiskar
It can be easily done using an RDD. rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your items. Here YourCustomPartitioner will know how to pick sample items from each partition. If you want to stick to Dataframe you can always repartition the data after you apply the limit.

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread manasdebashiskar
Have you tried persisting sourceFrame in (MEMORY_AND_DISK)? May be you can cache updatedRDD which gets used in next two lines. Are you sure you are paying the performance penalty because of shuffling only? Yes, group by is a killer. How much time does your code spend it GC? Can't tell if your

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread manasdebashiskar
Is that the only kind of error you are getting. Is it possible something else dies that gets buried in other messages. Try repairing HDFS (fsck etc) to find if blocks are intact. Few things to check 1) if you have too many small files. 2) Is your system complaining about too many inode etc.. 3)

Re: Kryo Serialization in Spark

2015-12-10 Thread manasdebashiskar
Are you sure you are using Kryo serialization. You are getting a java serialization error. Are you setting up your sparkcontext with kryo serialization enabled? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Serialization-in-Spark-tp25628p25678.html

Re: State management in spark-streaming

2015-12-10 Thread manasdebashiskar
Have you taken a look at trackStateBykey in spark streaming (coming in spark 1.6) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/State-management-in-spark-streaming-tp25608p25681.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "Address already in use" after many streams on Kafka

2015-12-10 Thread manasdebashiskar
you can provide spark ui port at while executing your context. spark.ui.port can be set to different port. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Address-already-in-use-after-many-streams-on-Kafka-tp25545p25683.html Sent from the Apache

Re: How to control number of parquet files generated when using partitionBy

2015-12-10 Thread manasdebashiskar
partitionBy is a suggestive field. If your value is bigger then what spark calculates(based on the obvious you stated) your value will be used. But repartition is a forced shuffle (but give me exactly required number of partition) operation. You might have noticed that repartition caused a bit of

Re: Spark job submission REST API

2015-12-10 Thread manasdebashiskar
We use ooyala job server. It is great. It has a great set of api's to cancel job. Create adhoc or persistent context etc. It has great support in remote deploy and tests too which helps faster coding. The current version is missing job progress bar but I could not find the same in the hidden

Re: Comparisons between Ganglia and Graphite for monitoring the Streaming Cluster?

2015-12-10 Thread manasdebashiskar
We use graphite monitoring. Currently we miss having email notifications for an alert. Not sure Ganglia has the same caveat. ..Manas -- View this message in context:

Re: Spark sql random number or sequence numbers ?

2015-12-10 Thread manasdebashiskar
use zipwithIndex to achieve the same behavior. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-random-number-or-sequence-numbers-tp25623p25679.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming Shuffle to Disk

2015-12-10 Thread manasdebashiskar
how often do you checkpoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567p25682.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does Spark SQL have to scan all the columns of a table in text format?

2015-12-10 Thread manasdebashiskar
Yes, Text file is schema less. Spark does not know what to skip so it will read everything. Parquet as you have stated is capable of taking advantage of predicate push down. ..Manas -- View this message in context:

Re: DataFrame: Compare each row to every other row?

2015-12-10 Thread manasdebashiskar
You can use the evil "group by key" and use a conventional method to compare against each row with in that iterable. If your similarity function is a n-1 iterable results for n input then you can use a flatmap to do all that stuff on worker side. spark also has cartesian product that might help in

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-10 Thread manasdebashiskar
Not sure what is your requirement there, but if you have a 2 second streaming batch , you can create a 4 second stream out of it but the other way is not possible. Basically you can create one stream out of another stream. ..Manas -- View this message in context:

Re: Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-06 Thread manasdebashiskar
My apologies for making this problem sound bigger then it actually is. After many more coffee break I discovered that scalikejdbc ConnectionPool.singleton(NOT_HOST_BUT_JDBC_URL, user, password) takes a url and not a host(At least for the version 2.3+) Hence it throws a very legitimate looking

Re: Spark SQL doesn't support column names that contain '-','$'...

2015-12-05 Thread manasdebashiskar
Try `X-P-S-T` (back tick) ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-doesn-t-support-column-names-that-contain-tp25529p25588.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-05 Thread manasdebashiskar
I don't think it is possible that way. Spark streaming is a minibatch processing system. If processing contents of 2 batch is your objective what you can do is 1) keep a cache(or two) that represent the previous batch(s). 2) every new batch replaces the old cache by one time slot back. 3) you

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-05 Thread manasdebashiskar
spark has capability to report to ganglia, graphite or jmx. If none of that works for you you can register your own spark extra listener that does your bidding. ..Manas -- View this message in context:

Re: Spark 1.5.2 getting stuck when reading from HDFS in YARN client mode

2015-12-05 Thread manasdebashiskar
You can check the executor thread dump to see what you "stuck" executor are doing. if you are running some monitoring tool then you can check if there is a heavy io going on (or network usage during that time) A little more info on what you are trying to do would be help. ..manas -- View this

Re: Fwd: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-05 Thread manasdebashiskar
SortByKey is available for pairedRDD. So if you RDD can be implicitly transformed to a pairedRDD you can do SortByKey then on. This magic is implicitly available if you import org.apache.spark.SparkContext._ -- View this message in context:

Re: maven built the spark-1.5.2 source documents, but error happened

2015-12-05 Thread manasdebashiskar
What information do you get when you try running with -X flag? Few notes before you start buildings are 1) Use the latest maven. 2) use java_home just incase. An example would be JAVA_HOME=/usr/lib/jvm/java-8-oracle ./make-distribution.sh --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0

Re: how to judge a DStream is empty or not after filter operation, so that return a boolean reault

2015-12-05 Thread manasdebashiskar
Usually while processing a DStream one uses foreachRDD. foreachRDD gives you to deal with an RDD which has a method isEmpty that you can use. ..Manas -- View this message in context:

Re: Obtaining Job Id for query submitted via Spark Thrift Server

2015-12-05 Thread manasdebashiskar
spark ui has a great rest api set. http://spark.apache.org/docs/latest/monitoring.html If you know your application id the rest should be easy. ..Manas -- View this message in context:

Re: partition RDD of images

2015-12-05 Thread manasdebashiskar
You can use a custom partitioner if your need is specific in any way. If you care about ordering then you can zipWithIndex your rdd and decide based on the sequence of the message. The following partitioner should work for you. class ExactPartitioner[V]( partitions: Int, elements: Int)

Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread manasdebashiskar
Depends on your need. Have you looked at Elastic search, or Accumulo or Cassandra? If post processing of your data is not your motive and you want to just retrieve the data later greenplum(based on postgresql) can be an alternative. in short there are many NOSQL out there with each having

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-05 Thread manasdebashiskar
When you enable check pointing your offsets get written in zookeeper. If you program dies or shutdowns and later restarted kafkadirectstream api knows where to start by looking at those offsets from zookeeper. This is as easy as it gets. However if you are planning to re-use the same checkpoint

Re: tmp directory

2015-12-05 Thread manasdebashiskar
If you look at your spark ui-> Environment you can see what are the path pointing to /tmp Typically java temp folder is also mapped to /tmp which can be over ridden by java opt . Spark logs go in /var/run/spark/work/... folder but I think you already know that. ..Manas -- View this message in

Re: Scala 2.11 and Akka 2.4.0

2015-12-05 Thread manasdebashiskar
There are steps to build spark using scala 2.11 in the spark docs. the first step is /dev/change-scala-version.sh 2.11 which changes the scala version to 2.11. I have not tried compiling spark with akka 2.4.0. ..Manas -- View this message in context:

Re: Spark checkpointing - restrict checkpointing to local file system?

2015-12-05 Thread manasdebashiskar
Try file:///path an example would be file://localhost/tmp/text.txt or file://192.168.1.25/tmp/text.txt ...manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpointing-restrict-checkpointing-to-local-file-system-tp25468p25596.html Sent from the

Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-05 Thread manasdebashiskar
Hi All, Has anyone tried using user defined database api for postgres on Spark 1.5.0 onwards. I have a build that uses Spark = 1.5.1 ScalikeJDBC= 2.3+ postgres driver = postgresql-9.3-1102-jdbc41.jar Spark SQL API to write dataframe to postgres works. But writing a spark RDD to postgres using

How to debug a hung spark application

2015-02-28 Thread manasdebashiskar
Hi, I have a spark application that hangs on doing just one task (Rest 200-300 task gets completed in reasonable time) I can see in the Thread dump which function gets stuck how ever I don't have a clue as to what value is causing that behaviour. Also, logging the inputs before the function is

Re: Spark 1.2 + Avro does not work in HDP2.2

2014-12-16 Thread manasdebashiskar
Hi All, I saw some helps online about forcing avro-mapred to hadoop2 using classifiers. Now my configuration is thus val avro= org.apache.avro % avro-mapred % V.avro classifier hadoop2 How ever I still get java.lang.IncompatibleClassChangeError. I think I am not building spark

Spark 1.2 + Avro does not work in HDP2.2

2014-12-12 Thread manasdebashiskar
Hi Experts, I have recently installed HDP2.2(Depends on hadoop 2.6). My spark 1.2 is built with hadoop 2.3 profile. /( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3 -Phive -DskipTests clean package)/ My program has following dependencies /val avro=

Re: Error outputing to CSV file

2014-12-10 Thread manasdebashiskar
saveAsSequenceFile method works on rdd. your object csv is a String. If you are using spark-shell you can type your object to know it's datatype. Some prefer eclipse(and it's intelli) to make their live easier. ..Manas - Manas Kar -- View this message in context:

Re: equivalent to sql in

2014-12-10 Thread manasdebashiskar
If you want to take out apple and orange you might want to try dataRDD.filter(_._2 !=apple).filter(_._2 !=orange) and so on. ...Manas - Manas Kar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html Sent from the

Re: Saving Data only if Dstream is not empty

2014-12-10 Thread manasdebashiskar
Can you do a countApprox as a condition to check non-empty RDD? ..Manas - Manas Kar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html Sent from the Apache Spark User List mailing list archive

Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread manasdebashiskar
It is wonderful to see some idea. Now the questions: 1) What is a track segment? Ans) It is the line that contains two adjacent points when all points are arranged by time. Say a vehicle moves (t1, p1) - (t2, p2) - (t3, p3). Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1

Re: Spark Monitoring with Ganglia

2014-10-05 Thread manasdebashiskar
Have you checked reactive monitoring(https://github.com/eigengo/monitor) or kamon monitoring (https://github.com/kamon-io/Kamon) Instrumenting needs absolutely no code change. All you do is weaving. In our environment we use Graphite to get the statsd(you can also get dtrace) events and display

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar
Correction to my question. (5) should read 5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile. Can someone please guide me in the right direction? Thanks in advance Manas - Manas Kar -- View this message in context:

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar
Correction to my question. (5) should read 5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile. Can someone please guide me in the right direction? Thanks in advance Manas On Fri, Oct 3, 2014 at 11:42 PM, manasdebashiskar [via Apache Spark User List] ml-node+s1001560n15729