from:"\"manasdebashiskar\""

Re: How State of mapWithState is distributed on nodes

2017-01-17 Thread manasdebashiskar

Does anyone have any answer.
How does the state distribution happen among multiple nodes.
I have seen that in "mapwithState" based workflow the streaming job simply
hangs when the node containing all states dies because of OOM.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-State-of-mapWithState-is-distributed-on-nodes-tp27593p28313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: mapwithstate Hangs with Error cleaning broadcast

2016-11-02 Thread manasdebashiskar

Yes, 
 In my case, my StateSpec had a small partition size. I increased the
numPartitions and the problem went away. (Details of why the problem was
happening in the first place is elided.)

 TL;DR
 StateSpec takes a "numPartitions" which can be set to high enough number.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500p27994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Can mapWithState state func be called every batchInterval?

2016-10-13 Thread manasdebashiskar

Actually each element of mapwithstate has a time out component. You can write
a function to "treat" your time out.

You can match it with your batch size and do fun stuff when the batch ends.

People do session management with the same approach.
When activity is registered the session is refreshed, and the session is
deleted("one way to treat it") when time out happens.

..Mana






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-mapWithState-state-func-be-called-every-batchInterval-tp27877p27898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Re-partitioning mapwithstateDstream

2016-10-13 Thread manasdebashiskar

StateSpec has a method numPartitions to set the initial number of partition.

That should do the trick.

...Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Open source Spark based projects

2016-09-23 Thread manasdebashiskar

check out spark packages https://spark-packages.org/ and you will find few
awesome and a lot of super awesome projects.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark stream on yarn oom

2016-09-22 Thread manasdebashiskar

It appears that the version against which your program is compiled is
different than that of the spark version you are running your code against.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-on-yarn-oom-tp27766p27782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Recovered state for updateStateByKey and incremental streams processing

2016-09-17 Thread manasdebashiskar

If you are using spark 1.6 onwards there is a better solution for you.
It is called mapwithState

mapwithState takes a state function and an initial RDD.

1) When you start your program for the first time/OR version changes and new
code can't use the checkpoint, the initialRDD comes handy.
2) For the rest of the occasion(i.e. program re-start after failure, or
regular stop/start for the same version) the checkpoint works for you.

Also, mapwithstate is easier to reason about then updatestatebykey and is
optimized to handle larger amount of data for the same amount of memory.

I use the same mechanism in production to great success.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recovered-state-for-updateStateByKey-and-incremental-streams-processing-tp9p27747.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark and DB connection pool

2016-03-25 Thread manasdebashiskar

Yes there is.
You can use the default dbcp or your own preferred connection pool manager.
Then when you ask for a connection you get one from the pool.

Take a look at this
https://github.com/manasdebashiskar/kafka-exactly-once
It is forked from Cody's repo.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-DB-connection-pool-tp26577p26596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Serialization issue with Spark

2016-03-25 Thread manasdebashiskar

You have not mentioned what task is not serializable.
The stack trace is usually a good idea while asking this question.

Usually spark will tell you what class it is not able to serialize. 
If it is one of your own class then try making it serializable or make it
transient so that it only gets created on the executor.

...Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-issue-with-Spark-tp26565p26595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best way to determine # of workers

2016-03-25 Thread manasdebashiskar

There is a sc.sparkDefaultParallelism parameter that I use to dynamically
maintain elasticity in my application. Depending upon your scenario this
might be enough.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-determine-of-workers-tp26586p26594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Create one DB connection per executor

2016-03-25 Thread manasdebashiskar

You are on the right track.
The only thing you will have to take care is when two of your partitions try
to access the same connection at the same time.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-one-DB-connection-per-executor-tp26588p26593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar

Your input is skewed in terms of the default hash partitioner that is used.
Your options are to use a custom partitioner that can re-distribute the data
evenly among your executors.

I think you will see the same behaviour when you use more executors. It is
just that the data skew appears to be less. To prove the same, use a even
bigger input for your job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manasdebashiskar

Hi, 
 I have a streaming application that takes data from a kafka topic and uses
mapwithstate.
 After couple of hours of smooth running of the application I see a problem
that seems to have stalled my application.
The batch seems to have been stuck after the following error popped up.
Has anyone seen this error or know what causes it? 
14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
cleaning broadcast 7456
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at
org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Problem using limit clause in spark sql

2015-12-25 Thread manasdebashiskar

It can be easily done using an RDD.

rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your
items.
Here YourCustomPartitioner will know how to pick sample items from each
partition.

If you want to stick to Dataframe you can always repartition the data after
you apply the limit.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789p25797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to control number of parquet files generated when using partitionBy

2015-12-10 Thread manasdebashiskar

partitionBy is a suggestive field.
If your value is bigger then what spark calculates(based on the obvious you
stated) your value will be used.
But repartition is a forced shuffle (but give me exactly required number of
partition) operation.
You might have noticed that repartition caused a bit of delay(due to
shuffling)

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436p25685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does Spark SQL have to scan all the columns of a table in text format?

2015-12-10 Thread manasdebashiskar

Yes, 
 Text file is schema less. Spark does not know what to skip so it will read
everything.
 Parquet as you have stated is capable of taking advantage of predicate push
down.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-SQL-have-to-scan-all-the-columns-of-a-table-in-text-format-tp25505p25684.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: "Address already in use" after many streams on Kafka

2015-12-10 Thread manasdebashiskar

you can provide spark ui port at while executing your context. spark.ui.port
can be set to different port.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Address-already-in-use-after-many-streams-on-Kafka-tp25545p25683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming Shuffle to Disk

2015-12-10 Thread manasdebashiskar

how often do you checkpoint?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567p25682.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: State management in spark-streaming

2015-12-10 Thread manasdebashiskar

Have you taken a look at trackStateBykey in spark streaming (coming in spark
1.6)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/State-management-in-spark-streaming-tp25608p25681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-10 Thread manasdebashiskar

Not sure what is your requirement there, but if you have a 2 second streaming
batch , you can create a 4 second stream out of it but the other way is not
possible.
Basically you can create one stream out of another stream.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-change-StreamingContext-batch-duration-after-loading-from-checkpoint-tp25624p25680.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark sql random number or sequence numbers ?

2015-12-10 Thread manasdebashiskar

use zipwithIndex to achieve the same behavior.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-random-number-or-sequence-numbers-tp25623p25679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Kryo Serialization in Spark

2015-12-10 Thread manasdebashiskar

Are you sure you are using Kryo serialization. 
You are getting a java serialization error.
Are you setting up your sparkcontext with kryo serialization enabled?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Serialization-in-Spark-tp25628p25678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Comparisons between Ganglia and Graphite for monitoring the Streaming Cluster?

2015-12-10 Thread manasdebashiskar

We use graphite monitoring.
Currently we miss having email notifications for an alert. Not sure Ganglia
has the same caveat.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Comparisons-between-Ganglia-and-Graphite-for-monitoring-the-Streaming-Cluster-tp25635p25677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DataFrame: Compare each row to every other row?

2015-12-10 Thread manasdebashiskar

You can use the evil "group by key" and use a conventional method to compare
against each row with in that iterable.
If your similarity function is a n-1 iterable results for n input then you
can use a flatmap to do all that stuff on worker side.
spark also has cartesian product that might help in your case. Though for
500 M Record it won't be performant.

..Manas





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-Compare-each-row-to-every-other-row-tp25639p25676.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread manasdebashiskar

Is that the only kind of error you are getting.
Is it possible something else dies that gets buried in other messages.
Try repairing HDFS (fsck etc) to find if blocks are intact.

Few things to check 
1) if you have too many small files.
2) Is your system complaining about too many inode etc..
3) Try smaller set while increasing the data set size to make sure it is
data volume related problem.
4) If you have monitoring turned on see what your driver, worker machines
cpu and disk io.
5) Have you tried increasing Driver memory(more partitions means driver
needs more memory to keep the metadata)

..Manas





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-Get-Timeout-error-and-FileNotFoundException-when-shuffling-large-files-tp25662p25675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark job submission REST API

2015-12-10 Thread manasdebashiskar

We use ooyala job server. It is great. It has a great set of api's to cancel
job. Create adhoc or persistent context etc.
It has great support in remote deploy and tests too which helps faster
coding.

The current version is missing job progress bar but I could not find the
same in the hidden spark api's either.

In any case I think job server is better than the hidden api's because it is
not hidden.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-submission-REST-API-tp25670p25674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread manasdebashiskar

Have you tried persisting sourceFrame in (MEMORY_AND_DISK)?
May be you can cache updatedRDD which gets used in next two lines.

Are you sure you are paying the performance penalty because of shuffling
only?
Yes, group by is a killer. How much time does your code spend it GC?

Can't tell if your group by is actually unavoidable but there are times when
the data is temporal and operations need just one element before or after,
zipwithIndex and reduce may be used to avoid the group by call.


..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671p25673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-06 Thread manasdebashiskar

My apologies for making this problem sound bigger then it actually is.
After many more coffee break I discovered that scalikejdbc
ConnectionPool.singleton(NOT_HOST_BUT_JDBC_URL, user, password) takes a url
and not a host(At least for the version 2.3+)

Hence it throws a very legitimate looking error.
The url takes the structure(for postgres)
"jdbc:postgresql://10.224.36.151/test"

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-in-Spark-1-5-0-onwards-while-loading-Postgres-JDBC-driver-tp25579p25610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Obtaining Job Id for query submitted via Spark Thrift Server

2015-12-05 Thread manasdebashiskar

You can set your jobgroup and job description programmatically.
I don't know how the interaction with thrift server works but you can always
report back the current job id via some extra spark listener.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Obtaining-Job-Id-for-query-submitted-via-Spark-Thrift-Server-tp25523p25598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-05 Thread manasdebashiskar

When you enable check pointing your offsets get written in zookeeper. If you
program dies or shutdowns and later restarted kafkadirectstream api knows
where to start by looking at those offsets from zookeeper.

This is as easy as it gets.
However if you are planning to re-use the same checkpoint folder among
different spark version that is currently not supported.
In that case you might want to go for writing the offset and topic in your
favorite database. Assuming that DB is high available you can later retried
the previously worked offset and start from there.

Take a look at the blog post of cody.(the guy who wrote kafkadirectstream)
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461p25597.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark checkpointing - restrict checkpointing to local file system?

2015-12-05 Thread manasdebashiskar

Try file:///path
an example would be file://localhost/tmp/text.txt
or file://192.168.1.25/tmp/text.txt

...manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpointing-restrict-checkpointing-to-local-file-system-tp25468p25596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: tmp directory

2015-12-05 Thread manasdebashiskar

If you look at your spark ui-> Environment you can see what are the path
pointing to /tmp
Typically java temp folder is also mapped to /tmp which can be over ridden
by java opt .

Spark logs go in /var/run/spark/work/... folder but I think you already know
that.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/tmp-directory-tp25476p25595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread manasdebashiskar

Depends on your need.
Have you looked at Elastic search, or Accumulo or Cassandra?
If post processing of your data is not your motive and you want to just
retrieve the data later greenplum(based on postgresql) can be an
alternative.

in short there are many NOSQL out there with each having different project
maturity and feature sets.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462p25594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: partition RDD of images

2015-12-05 Thread manasdebashiskar

You can use a custom partitioner if your need is specific in any way.
If you care about ordering then you can zipWithIndex your rdd and decide
based on the sequence of the message.

The following partitioner should work for you.


class ExactPartitioner[V](
partitions: Int,
elements: Int)
  extends Partitioner {

  def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
// `k` is assumed to go continuously from 0 to elements-1.
return k * partitions / elements
  }
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/partition-RDD-of-images-tp25515p25592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Obtaining Job Id for query submitted via Spark Thrift Server

2015-12-05 Thread manasdebashiskar

spark ui has a great rest api set.
http://spark.apache.org/docs/latest/monitoring.html
If you know your application id the rest should be easy.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Obtaining-Job-Id-for-query-submitted-via-Spark-Thrift-Server-tp25523p25591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.5.2 getting stuck when reading from HDFS in YARN client mode

2015-12-05 Thread manasdebashiskar

You can check the executor thread dump to see what you "stuck" executor are
doing.
if you are running some monitoring tool then you can check if there is a
heavy io going on (or network usage during that time)
A little more info on what you are trying to do would be help.

..manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-2-getting-stuck-when-reading-from-HDFS-in-YARN-client-mode-tp25527p25589.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL doesn't support column names that contain '-','$'...

2015-12-05 Thread manasdebashiskar

Try `X-P-S-T` (back tick)

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-doesn-t-support-column-names-that-contain-tp25529p25588.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scala 2.11 and Akka 2.4.0

2015-12-05 Thread manasdebashiskar

There are steps to build spark using scala 2.11 in the spark docs.
the first step is 
/dev/change-scala-version.sh 2.11 which changes the scala version to 2.11.

I have not tried compiling spark with akka 2.4.0.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-2-11-and-Akka-2-4-0-tp25535p25587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-05 Thread manasdebashiskar

spark has capability to report to ganglia, graphite or jmx.
If none of that works for you you can register your own spark extra listener
that does your bidding.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Effective-ways-monitor-and-identify-that-a-Streaming-job-has-been-failing-for-the-last-5-minutes-tp25536p25586.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming - controlling Cached table registered in memory created from each RDD of a windowed stream

2015-12-05 Thread manasdebashiskar

Ans1) It is the same table name.
Ans2) I think you mean persist(memory) = cache and unpersist. 
If it is your program is caching an dataframe you unpersist it manually. 
I think if your cached data structure is not being utilized then the new
cache will evict the old one. 
But if memory is your concern you can use persist(memory & disk) where you
data will spill to disk.

..Manas




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-controlling-Cached-table-registered-in-memory-created-from-each-RDD-of-a-windowed-stm-tp25547p25585.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how to judge a DStream is empty or not after filter operation, so that return a boolean reault

2015-12-05 Thread manasdebashiskar

Usually while processing a DStream one uses foreachRDD.
foreachRDD gives you to deal with an RDD which has a method isEmpty that you
can use.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-judge-a-DStream-is-empty-or-not-after-filter-operation-so-that-return-a-boolean-reault-tp25553p25583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Fwd: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-05 Thread manasdebashiskar

SortByKey is available for pairedRDD.
So if you RDD can be implicitly transformed to a pairedRDD 
you can do SortByKey then on.

This magic is implicitly available if you import
org.apache.spark.SparkContext._ 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-How-to-get-the-list-of-available-Transformations-and-actions-for-a-RDD-in-Spark-Shell-tp25565p25582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-05 Thread manasdebashiskar

I don't think it is possible that way.
Spark streaming is a minibatch processing system.

If processing contents of 2 batch is your objective what you can do is 
1) keep a cache(or two) that represent the previous batch(s).
2) every new batch replaces the old cache by one time slot back.
3) you process each batch as rdd in parallel.

...Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-spark-streaming-application-start-working-on-next-batch-before-completing-on-previous-batch-tp25559p25581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: maven built the spark-1.5.2 source documents, but error happened

2015-12-05 Thread manasdebashiskar

What information do you get when you try running with -X flag?
Few notes before you start buildings are
1) Use the latest maven.
2) use java_home just incase.

An example would be

JAVA_HOME=/usr/lib/jvm/java-8-oracle ./make-distribution.sh --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0 -Dscala-2.10 -DskipTests





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/maven-built-the-spark-1-5-2-source-documents-but-error-happened-tp25578p25580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-05 Thread manasdebashiskar

Hi All, 
 Has anyone tried using user defined database api for postgres on Spark
1.5.0 onwards.
 I have a build that uses 
Spark = 1.5.1
ScalikeJDBC= 2.3+
postgres driver = postgresql-9.3-1102-jdbc41.jar
Spark SQL API to write dataframe to postgres works.
But writing a spark RDD to postgres using ScalikeJDBC does not work.
My code gets past 
Class.forName("org.postgresql.Driver") line which means the driver is
loaded(Also spark sql api works )

Do I need to load the jar differently so that non spark code will see it? 
Things I have tried
=
1) I have tried shading postgres jar to my assembly with out any success.
2) I have tried providing postgres jar via extra classpath to executors.
3) Took a sip of coffee and cola in an alternating fashion.

The stack trace is given below.

java.sql.SQLException: No suitable driver found for 10.224.36.151
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at
org.apache.commons.dbcp2.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:77)
at
org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:256)
at
org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:868)
at
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
at
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at
org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134)
at
scalikejdbc.Commons2ConnectionPool.borrow(Commons2ConnectionPool.scala:41)
at scalikejdbc.DB$.localTx(DB.scala:257)
at com.exactearth.lvi.db.TODB$$anonfun$insertTODB$1.apply(TODB.scala:71)
at com.exactearth.lvi.db.TODB$$anonfun$insertTODB$1.apply(TODB.scala:69)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-in-Spark-1-5-0-onwards-while-loading-Postgres-JDBC-driver-tp25579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

PairRDD serialization exception

2015-03-11 Thread manasdebashiskar

(This is a repost. May be a simpler subject will fetch more attention among
experts)

Hi,
 I have a CDH5.3.2(Spark1.2) cluster.
 I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)

Appreciate any help.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 3.0 (TID 346, datanode02):
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
class incompatible:stream classdesc serialVersionUID = 8789839749593513237,
local class serialVersionUID = -4145741279224749316
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Thanks
Manas




-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to debug a hung spark application

2015-02-28 Thread manasdebashiskar

Hi,
 I have a spark application that hangs on doing just one task (Rest 200-300
task gets completed in reasonable time)
I can see in the Thread dump which function gets stuck how ever I don't
have a clue as to what value is causing that behaviour.
Also, logging the inputs before the function is executed does not help as
the actual message gets buried in logs.

How do one go about debugging such case?
Also, is there a way I can wrap my function inside some sort of timer based
environment and if it took too long I would throw a stack trace or some
sort.

Thanks




-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-debug-a-hung-spark-application-tp21859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.2 + Avro does not work in HDP2.2

2014-12-16 Thread manasdebashiskar

Hi All, 
 I saw some helps online about forcing avro-mapred to hadoop2 using
classifiers.

 Now my configuration is thus
 val avro= "org.apache.avro" % "avro-mapred" % V.avro classifier
"hadoop2" 

How ever I still get java.lang.IncompatibleClassChangeError. I think I am
not building spark correctly. Clearly the following steps is missing
something avro related.

/( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3
-Phive -DskipTests clean package)/


*Can someone please help me build spark1.2 for either CDH5.2 or HDP2.2  +
Hive + Avro *

Thanks





-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Avro-does-not-work-in-HDP2-2-tp20667p20721.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.2 + Avro does not work in HDP2.2

2014-12-12 Thread manasdebashiskar

Hi Experts, 
 I have recently installed HDP2.2(Depends on hadoop 2.6).
 My spark 1.2 is built with hadoop 2.3 profile. 
/( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3
-Phive -DskipTests clean package)/

 My program has following dependencies
/val avro= "org.apache.avro" % "avro-mapred" %"1.7.7"
val spark   = "org.apache.spark" % "spark-core_2.10" % "1.2.0" %
"provided"/

My program to read avro files fails with the following error. What am I
doing wrong?

Thanks
Manas

java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Avro-does-not-work-in-HDP2-2-tp20667.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Saving Data only if Dstream is not empty

2014-12-10 Thread manasdebashiskar

Can you do a countApprox as a condition to check non-empty RDD? 

..Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: equivalent to sql in

2014-12-10 Thread manasdebashiskar

If you want to take out "apple" and "orange" you might want to try 
dataRDD.filter(_._2 !="apple").filter(_._2 !="orange") and so on.

...Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error outputing to CSV file

2014-12-10 Thread manasdebashiskar

saveAsSequenceFile method works on rdd. your object csv is a String.
If you are using spark-shell you can type your object to know it's datatype.
Some prefer eclipse(and it's intelli) to make their live easier.

..Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-outputing-to-CSV-file-tp20583p20615.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread manasdebashiskar

It is wonderful to see some idea.
Now the questions:
1) What is a track segment?
 Ans) It is the line that contains two adjacent points when all points are
arranged by time. Say a vehicle moves (t1, p1) -> (t2, p2) -> (t3, p3).
Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1 < t2
< t3)
2) What is Lag function.
Ans) Sean's link explains it.

Little bit more to my requirement:
 What I need to calculate is a density Map of vehicles in a certain area.
Because of a user specific requirement I can't use just points but I will
have to use segments.
 I already have a gridRDD containing 1km polygons for the whole world.
My approach is
1) create a tracksegmentRDD of Vehicle, segment
2) do a cartesian of tracksegmentRDD and gridRDD and for each row check if
the segment intersects the polygon. If it does then count it as 1.
3) Group the result above by vehicle(probably reduceByKey(_ + _) ) to get
the density Map

I am checking an issue
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html
which seems to have some potential. I will give it a try.

..Manas

On Wed, Oct 15, 2014 at 2:55 AM, sowen [via Apache Spark User List] <
ml-node+s1001560n16471...@n3.nabble.com> wrote:

> You say you reduceByKey but are you really collecting all the tuples
> for a vehicle in a collection, like what groupByKey does already? Yes,
> if one vehicle has a huge amount of data that could fail.
>
> Otherwise perhaps you are simply not increasing memory from the default.
>
> Maybe you can consider using something like vehicle and *day* as a
> key. This would make you process each day of data separately, but if
> that's fine for you, might drastically cut down the data associated to
> a single key.
>
> Spark Streaming has a windowing function, and there is a window
> function for an entire RDD, but I am not sure if there is support for
> a 'window by key' anywhere. You can perhaps get your direct approach
> of collecting events working with some of the changes above.
>
> Otherwise I think you have to roll your own to some extent, creating
> the overlapping buckets of data, which will mean mapping the data to
> several copies of itself. This might still be quite feasible depending
> on how big a lag you are thinking of.
>
> PS for the interested, this is what LAG is:
>
> http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php#lag
>
> On Wed, Oct 15, 2014 at 1:37 AM, Manas Kar <[hidden email]
> > wrote:
>
> > Hi,
> >  I have an RDD containing Vehicle Number , timestamp, Position.
> >  I want to get the "lag" function equivalent to my RDD to be able to
> create
> > track segment of each Vehicle.
> >
> > Any help?
> >
> > PS: I have tried reduceByKey and then splitting the List of position in
> > tuples. For me it runs out of memory every time because of the volume of
> > data.
> >
> > ...Manas
> >
> > For some reason I have never got any reply to my emails to the user
> group. I
> > am hoping to break that trend this time. :)
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Lag-function-equivalent-in-an-RDD-tp16448p16471.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Lag function equivalent in an RDD, click here
> 
> .
> NAML
> 
>

-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lag-function-equivalent-in-an-RDD-tp16448p16498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Monitoring with Ganglia

2014-10-05 Thread manasdebashiskar

Have you checked reactive monitoring(https://github.com/eigengo/monitor) or

kamon monitoring (https://github.com/kamon-io/Kamon)
Instrumenting needs absolutely no code change. All you do is weaving.
In our environment we use Graphite to get the statsd(you can also get
dtrace) events and display it.

Hope this helps.

..Manas

On Fri, Oct 3, 2014 at 5:34 PM, TANG Gen [via Apache Spark User List] <
ml-node+s1001560n15705...@n3.nabble.com> wrote:

> Maybe you can follow the instruction in this link
> https://github.com/mesos/spark-ec2/tree/v3/ganglia. For me it works well
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-tp15538p15705.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>

-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-tp15538p15763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar

Correction to my question. (5) should read
5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile.

Can someone please guide me in the right direction?

Thanks in advance
Manas

On Fri, Oct 3, 2014 at 11:42 PM, manasdebashiskar [via Apache Spark User
List]  wrote:

> Correction to my question. (5) should read
> 5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile.
>
> Can someone please guide me in the right direction?
>
> Thanks in advance
> Manas
> Manas Kar
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Null-values-in-Date-field-only-when-RDD-is-saved-as-File-tp15727p15729.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bWFuYXNkZWJhc2hpc2thckBnbWFpbC5jb218MXwtMzQ3NzgyNTAy>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Null-values-in-Date-field-only-when-RDD-is-saved-as-File-tp15727p15730.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Null values in Date field only when RDD is saved as File.

2014-10-03 Thread manasdebashiskar

Correction to my question. (5) should read 
5) save the tuple RDD(created at step 3) to HDFS using SaveAsTextFile.

Can someone please guide me in the right direction?

Thanks in advance
Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Null-values-in-Date-field-only-when-RDD-is-saved-as-File-tp15727p15729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

56 matches

Mail list logo