Re: MLLib SVMWithSGD is failing for large dataset

2016-07-06 Thread Chitturi Padma
Hi Sarath,

By any chance have you resolved this issue ?

Thanks,
Padma CH

On Tue, Apr 28, 2015 at 11:20 PM, sarath [via Apache Spark User List] <
ml-node+s1001560n22694...@n3.nabble.com> wrote:

>
> I am trying to train a large dataset consisting of 8 million data points
> and 20 million features using SVMWithSGD. But it is failing after running
> for some time. I tried increasing num-partitions, driver-memory,
> executor-memory, driver-max-resultSize. Also I tried by reducing the size
> of dataset from 8 million to 25K (keeping number of features same 20 M).
> But after using the entire 64GB driver memory for 20 to 30 min it failed.
>
> I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM).
> executor-memory - 60G
> driver-memory - 60G
> num-executors - 64
> And other default settings
>
> This is the error log :
>
> 15/04/20 11:51:09 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> 15/04/20 11:56:11 WARN TransportChannelHandler: Exception in connection
> from xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:11 ERROR TransportResponseHandler: Still have 7 requests
> outstanding when connection from xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029 is
> closed
> 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:12 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:290)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
>
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
>
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused:
> xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma
By default spark uses 2 executors with one core each, have you allocated
more executors using the command line args as -
--num-executors 25 --executor-cores x  ???

What do you mean by the difference between the nodes is huge ?

Regards,
Padma Ch

On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via Apache Spark User List] <
ml-node+s1001560n2650...@n3.nabble.com> wrote:

> Hi,
>
> I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
> I observe a very strange issue.
> I run a simple job that reads about 1TB of json logs from a remote HDFS
> cluster and converts them to parquet, then saves them to the local HDFS of
> the Hadoop cluster.
>
> I run it with 25 executors with sufficient resources. However the strange
> thing is that the job only uses 2 executors to do most of the read work.
>
> For example when I go to the Executors' tab in the Spark UI and look at
> the "Input" column, the difference between the nodes is huge, sometimes 20G
> vs 120G.
>
> 1. What is the cause for this behaviour?
> 2. Any ideas how to achieve a more balanced performance?
>
> Thanks,
> Borislav
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26503.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Chitturi Padma
*Something like below ...*

*Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space at
org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
*
*org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
*
*org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294) *
*org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273) *
* org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:324) *
*org.apache.spark.io.SnappyOutputStreamWrapper.close(CompressionCodec.scala:203)
*
* com.esotericsoftware.kryo.io.Output.close(Output.java:168) *
*org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
*


On Mon, Mar 14, 2016 at 8:35 PM, adamreith [via Apache Spark User List] <
ml-node+s1001560n26483...@n3.nabble.com> wrote:

> What you mean? I 've pasted the output in the same format used by spark...
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Exception-in-my-spark-streaming-application-tp26479p26483.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Exception-in-my-spark-streaming-application-tp26479p26485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Chitturi Padma
Hi,

 Can you please try to show the stack trace line by line, because its bit
difficult to read the entire paragraph and make sense out of it .

On Mon, Mar 14, 2016 at 3:11 PM, adamreith [via Apache Spark User List] <
ml-node+s1001560n26479...@n3.nabble.com> wrote:

> Hi,
>
> I'm using spark 1.4.1 and i have a simple application that create a
> dstream that read data from kafka and apply a filter transformation on it.
> After more or less a day throw the following exception:
>
> *Exception in thread "dag-scheduler-event-loop"
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
> at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
> at
> org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
> at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273)
> at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:324)
> at
> org.apache.spark.io.SnappyOutputStreamWrapper.close(CompressionCodec.scala:203)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:168) at
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
> at
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
> at
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1291) at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
> at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1426)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) [Stage
> 53513:>  (0 + 0) /
> 4]Exception in thread "JobGenerator" java.lang.OutOfMemoryError: GC
> overhead limit exceeded at
> sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41) at
> java.net.URL.openConnection(URL.java:972) at
> java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:237) at
> java.lang.Class.getResourceAsStream(Class.java:2223) at
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:38)
> at
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:98)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at
> org.apache.spark.SparkContext.clean(SparkContext.scala:1893) at
> org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at
> org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at
> org.apache.spark.rdd.RDD.map(RDD.scala:293) at
> org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
> at scala.Option.map(Option.scala:145) at
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:25

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
Hi All,

  I  am facing the same issue. taking k values from 60 to 120 incrementing
by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
around 2.5h on a 800 MB data set with 38 dimensions.

On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] <
ml-node+s1001560n2227...@n3.nabble.com> wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did.
> I want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: forgetfulness in clustering algorithm

2016-03-12 Thread Chitturi Padma
Hi,

  I am interested in the Streaming k-means algorithm and the parameter
forgetfulness. Please some one can throw light on this ?

On Wed, Jul 29, 2015 at 11:23 AM, AmmarYasir [via Apache Spark User List] <
ml-node+s1001560n24050...@n3.nabble.com> wrote:

>
> I read the post regarding clustering " Introducing streaming k-means in
> park 1.2 " at https://databricks.com/blog/.
>
> I was interested in some detail regarding " the forgetfulness" in
> clustering techniques please can
> any one suggested me any material e.g research papers or article that
> would be worthwhile.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/forgetfulness-in-clustering-algorithm-tp24050.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/forgetfulness-in-clustering-algorithm-tp24050p26466.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
If you want to do processing in parallel, never use collect or any action
such as count or first, they compute the result and bring it back to
driver. rdd.map does processing in parallel. Once you have processed rdd
then save it to DB.

 rdd.foreach executes on the workers, Infact, it returns unit.



On Wed, Feb 24, 2016 at 11:56 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26325...@n3.nabble.com> wrote:

> @Chitturi-Thanks a lot for replying
>
> 2 followup questions :
>
> 1. what if I am not collecting Rdd, then will Rdd.foreach() and Rdd.map()
> do processing in parallel ?
>
>
> 2. Let's say I have to get the results first and then do something before
> saving them into database. But I want to do that in parallel? How should I
> do it ? I am using Rdd.collect().foreach(), but it is not doing
> processing in parallel.
>
> Regards
> Anurag
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26325.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-24 Thread Chitturi Padma
Hi,

 I didn't get the point that you want to mention i.e "distribute
computation across nodes by restricting parallelism on each node". Do you
mean per node you are expecting only one task to run ?
Can you please paste the configuration changes you made ?

On Wed, Feb 24, 2016 at 11:24 PM, firemonk91 [via Apache Spark User List] <
ml-node+s1001560n26323...@n3.nabble.com> wrote:

> Can you paste the logs as well.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Restricting-number-of-cores-not-resulting-in-reduction-in-parallelism-tp26319p26323.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Restricting-number-of-cores-not-resulting-in-reduction-in-parallelism-tp26319p26324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
rdd.collect() never does any processing on the workers. It brings the
entire rdd as an in-memory collection back to driver

On Wed, Feb 24, 2016 at 10:58 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26320...@n3.nabble.com> wrote:

> Hi Everyone
>
> I am new to Scala and Spark.
>
> I want to know
>
> 1. does Rdd.collect().foreach() do processing in parallel?
>
> 2. does Rdd.collect().map() do processing in parallel ?
>
> Thanks in advance.
> Regards
> Anurag
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Read from kafka after application is restarted

2016-02-23 Thread Chitturi Padma
Hi Vaibhav,

  As you said, from the second link,  I can figure out that, it is not able
to cast the class when it is trying to read from checkpoint. Can you try
explicit casting like asInstanceOf[T] for the broad casted value ?

>From the bug, looks like it affects version 1.5. Try sample wordcount
program with same spark 1.3 version and try to re-produce the error. If you
are able to, then change the version of spark to 1.5 or 1.4 and see if the
issue is seen. If not, then possibly it should have been fixed in latest
version.

As direct API is introduced in spark 1.3, it is likely to have bugs.

On Tue, Feb 23, 2016 at 5:09 PM, vaibhavrtk1 [via Apache Spark User List] <
ml-node+s1001560n26304...@n3.nabble.com> wrote:

> Hello
>
> I have tried with Direct API but i am getting this an error, which is
> being tracked here https://issues.apache.org/jira/browse/SPARK-5594
>
> I also tried using Receiver approach with Write Ahead Logs ,then this
> issue comes
> https://issues.apache.org/jira/browse/SPARK-12407
>
> In both cases it seems it is not able to get the broadcasted variable from
> checkpoint directory.
> Attached is the screenshot of errors I faced with both approaches.
>
> What do you guys suggest for solving this issue?
>
>
> *Vaibhav Nagpal*
> 9535433788
> 
>
> On Tue, Feb 23, 2016 at 1:50 PM, Gideon [via Apache Spark User List] <[hidden
> email] > wrote:
>
>> Regarding the spark streaming receiver - can't you just use Kafka direct
>> receivers with checkpoints? So when you restart your application it will
>> read where it last stopped and continue from there
>> Regarding limiting the number of messages - you can do that by setting
>> spark.streaming.receiver.maxRate. Read more about it here
>> 
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26303.html
>> To unsubscribe from Read from kafka after application is restarted, click
>> here.
>> NAML
>> 
>>
>
>
> *Capture.JPG* (222K) Download Attachment
> 
> *Capture1.JPG* (169K) Download Attachment
> 
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26304.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Read from kafka after application is restarted

2016-02-22 Thread Chitturi Padma
Hi Vaibhav,

 Please try with Kafka direct API approach. Is this not working ?

-- Padma Ch


On Tue, Feb 23, 2016 at 12:36 AM, vaibhavrtk1 [via Apache Spark User List] <
ml-node+s1001560n26291...@n3.nabble.com> wrote:

> Hi
>
> I am using kafka with spark streaming 1.3.0 . When the spark application
> is not running kafka is still receiving messages. When i start the
> application those messages which have already been received when spark was
> not running are not processed. I am using a unreliable receiver based
> approach.
>
> What can I do to process earlier messages also, which came while
> application was shut down?
>
> PS: If application was down for a long time can i also limit the max
> number of message consumed in one batch interval?
>
> Regards
> Vaibhav
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26292.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does Spark satisfy my requirements?

2016-02-22 Thread Chitturi Padma
Hi,

When you say that you want to produce new information, are you looking
forward to put the processed data in other consumers ?
Spark will be definitely the choice for real-time streaming computations.
Are you looking for near-real time processing or exactly real-time
processing ?


On Sun, Feb 21, 2016 at 6:49 PM, ebrahim [via Apache Spark User List] <
ml-node+s1001560n26283...@n3.nabble.com> wrote:

> Dear all,
>
> I am involved in a project which aims to gather data of about 400 MB per
> second, process the received data, produce new information, and then
> forward them to other consumers. All this process should take about 200 ms.
>
> My question is that does Spark is a good choice for me considering the
> above mentioned scenario?
>
> Best,
> Ebrahim
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-satisfy-my-requirements-tp26283.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-satisfy-my-requirements-tp26283p26290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job Opportunity in London

2015-03-30 Thread Chitturi Padma
Hi,

I am interested in this opportunity. I am working as Research Engineer in
Impetus Technologies, Bangalore, India. In fact we implemented Distributed
Deep Learning on Spark. Will share my CV if you are interested.
Please visit the below link:

http://www.signalprocessingsociety.org/newsletter/2015/02/implementing-a-distributed-deep-learning-network-over-spark/
​




On Mon, Mar 30, 2015 at 3:10 PM, janboehm [via Apache Spark User List] <
ml-node+s1001560n22290...@n3.nabble.com> wrote:

> Dear Spark-Enthusiasts,
> I am looking for a Research Assistant in Cloud-based Machine Learning to
> join my team at University College London. The main role will be to
> implement distributed algorithms for feature extraction, machine learning
> and classification for large-scale geo-spatial data. This post is funded by
> a collaborative EU project of prestigious European institutions. The
> details of the UCL job advert (Ref:1458193) are online at
> http://tinyurl.com/pzhndwm
>
> Kind Regards,
> Jan Boehm
> Senior Lecturer in Photogrammetry & 3D Imaging
> University College London
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-Opportunity-in-London-tp22290.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-Opportunity-in-London-tp22290p22298.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
It worked for me. spark 1.2.0 with hadoop 2.2.0

On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via Apache Spark User List] <
ml-node+s1001560n21207...@n3.nabble.com> wrote:

> Hi all,
>
> Thanks for your contribution. We have checked and confirmed that HDP 2.1
> YARN don't work with Spark 1.2
>
> On Sat, Jan 17, 2015 at 9:11 AM, bhavya teja potineni <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=21207&i=0>> wrote:
>
>> Hi
>>
>> Did you try using spark 1.2 on hdp 2.1 YARN
>>
>> Can you please go thru the thread
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Troubleshooting-Spark-tt21189.html
>> and check where I am going wrong. As my word count program is erroring
>> out when using spark 1.2 using YARN but its getting executed using spark
>> 0.9.1
>>
>> On Sat, Jan 17, 2015 at 5:55 AM, Chitturi Padma [via Apache Spark User
>> List] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=21207&i=1>> wrote:
>>
>>> Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues.
>>>
>>> On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List]
>>> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=21202&i=0>>
>>> wrote:
>>>
>>>> Is spark 1.2 is compatibly with HDP 2.1
>>>>
>>>> --
>>>>  If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
>>>>  To start a new topic under Apache Spark User List, email [hidden
>>>> email] <http:///user/SendEmail.jtp?type=node&node=21202&i=1>
>>>> To unsubscribe from Apache Spark User List, click here.
>>>> NAML
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>>
>>> --
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21202.html
>>>  To unsubscribe from spark 1.2 compatibility, click here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21207.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues.

On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List] <
ml-node+s1001560n21197...@n3.nabble.com> wrote:

> Is spark 1.2 is compatibly with HDP 2.1
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-20 Thread Chitturi Padma
Hi,

I tried with try catch  blocks. Infact, inside mapPartitionsWithIndex,
method is invoked which does the operation. I put the operations inside the
function in try...catch block but thats of no use...still the error
persists. Even I commented all the operations and a simple print statement
inside the method is not executed. The data size is 542 MB. hdfs block size
is 64 MB and it has got 9 blocks. I used a 2 node cluster with rep.factor
2.

When is see the logs, it seemed to me like it tried to launch tasks on the
other node ..but TaskSetManager has encountered Null pointer exception and
the job is aborted. Is this the problem with mapPartitionWithIndex ?

The same operations when performed with map transformation, it got executed
with no issues.


Please let me know if anyone has the same problem ?

Thanks,
Padma Ch

On Fri, Nov 14, 2014 at 7:42 PM, Akhil [via Apache Spark User List] <
ml-node+s1001560n18936...@n3.nabble.com> wrote:

> It shows nullPointerException, your data could be corrupted? Try putting a
> try catch inside the operation that you are doing, Are you running the
> worker process on the master node also? If not, then only 1 node will be
> doing the processing. If yes, then try setting the level of parallelism and
> number of partitions while creating/transforming the RDD.
>
> Thanks
> Best Regards
>
> On Fri, Nov 14, 2014 at 5:17 PM, Priya Ch <[hidden email]
> > wrote:
>
>> Hi All,
>>
>>   We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is
>> having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set
>> up hdfs which has 2 TB capacity and the block size is 256 mb   When we try
>> to process 1 gb file on spark, we see the following exception
>>
>> 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 0.0 in
>> stage 0.0 (TID 0, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
>> 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 1.0 in
>> stage 0.0 (TID 1, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
>> 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 2.0 in
>> stage 0.0 (TID 2, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
>> 14/11/14 17:01:43 INFO cluster.SparkDeploySchedulerBackend: Registered
>> executor: 
>> Actor[akka.tcp://sparkExecutor@IMPETUS-DSRV02:41124/user/Executor#539551156]
>> with ID 0
>> 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
>> manager NODE-DSRV05.impetus.co.in:60432 with 2.1 GB RAM
>> 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
>> manager NODE-DSRV02:47844 with 2.1 GB RAM
>> 14/11/14 17:01:43 INFO network.ConnectionManager: Accepted connection
>> from [NODE-DSRV05.impetus.co.in/192.168.145.195:51447]
>> 14/11/14 17:01:43 INFO network.SendingConnection: Initiating connection
>> to [NODE-DSRV05.impetus.co.in/192.168.145.195:60432]
>> 14/11/14 17:01:43 INFO network.SendingConnection: Connected to [
>> NODE-DSRV05.impetus.co.in/192.168.145.195:60432], 1 messages pending
>> 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
>> in memory on NODE-DSRV05.impetus.co.in:60432 (size: 17.1 KB, free: 2.1
>> GB)
>> 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
>> in memory on NODE-DSRV05.impetus.co.in:60432 (size: 14.1 KB, free: 2.1
>> GB)
>> 14/11/14 17:01:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 0.0 (TID 0, NODE-DSRV05.impetus.co.in): java.lang.NullPointerException:
>> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)
>> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)
>>
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> java.lang.Thread.run(Thread.java:722)
>> 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.1 in
>> stage 0.0 (TID 3, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
>> 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 1.0 in stage
>> 0.0 (TID 1) on executor NODE-DSRV05.impetus.co.in:
>> java.lang.NullPointerException (null) [duplicate 1]
>> 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.0 in stage
>> 0.0 (TID 2) on executor NODE-DSRV05.impetus.co.in:
>> java.lang.NullPointer

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
As you are using sbt ..u need not put in ~/.m2/repositories for maven.
Include the jar explicitly using the option
--driver-class-path while submitting the jar to spark cluster

On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User
List]  wrote:

> It's still not working. Keep getting the same error.
>
> I even deleted the commons-math3/* folder containing the jar. And then
> under the directory "org/apache/commons/" made a folder called 'math3'
> and put the commons-math3-3.3.jar in it.
> Still its not working.
>
> I also tried sc.addJar("/path/to/jar") within spark-shell and in my
> project sourcefile
> It still didn't import the jar at both locations.
> More
>
> Any fixes? Please help
>
> On Mon, Nov 17, 2014 at 2:14 PM, Chitturi Padma <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19073&i=0>> wrote:
>
>> Include the commons-math3/3.3 in class path while submitting jar to spark
>> cluster. Like..
>> spark-submit --driver-class-path maths3.3jar --class MainClass --master
>> spark cluster url appjar
>>
>> On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark
>> User List] <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=19057&i=0>> wrote:
>>
>>> My sbt file for the project includes this:
>>>
>>> libraryDependencies ++= Seq(
>>> "org.apache.spark"  %% "spark-core"  % "1.1.0",
>>> "org.apache.spark"  %% "spark-mllib" % "1.1.0",
>>> "org.apache.commons" % "commons-math3" % "3.3"
>>> )
>>> =
>>>
>>> Still I am getting this error:
>>>
>>> >java.lang.NoClassDefFoundError:
>>> org/apache/commons/math3/random/RandomGenerator
>>>
>>> =
>>>
>>> The jar at location:
>>> ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random
>>> generator class:
>>>
>>>  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
>>> >org/apache/commons/math3/random/RandomGenerator.class
>>> >org/apache/commons/math3/random/UniformRandomGenerator.class
>>> >org/apache/commons/math3/random/SynchronizedRandomGenerator.class
>>> >org/apache/commons/math3/random/AbstractRandomGenerator.class
>>> >org/apache/commons/math3/random/RandomGeneratorFactory$1.class
>>> >org/apache/commons/math3/random/RandomGeneratorFactory.class
>>> >org/apache/commons/math3/random/StableRandomGenerator.class
>>> >org/apache/commons/math3/random/NormalizedRandomGenerator.class
>>> >org/apache/commons/math3/random/JDKRandomGenerator.class
>>> >org/apache/commons/math3/random/GaussianRandomGenerator.class
>>>
>>>
>>> Please help
>>>
>>>
>>> --
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
>>>  To start a new topic under Apache Spark User List, email [hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=19057&i=1>
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> --
>> View this message in context: Re: RandomGenerator class not found
>> exception
>> <http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19073.html
>  To start a new topic under Apache Spa

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
Include the commons-math3/3.3 in class path while submitting jar to spark
cluster. Like..
spark-submit --driver-class-path maths3.3jar --class MainClass --master
spark cluster url appjar

On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User
List]  wrote:

> My sbt file for the project includes this:
>
> libraryDependencies ++= Seq(
> "org.apache.spark"  %% "spark-core"  % "1.1.0",
> "org.apache.spark"  %% "spark-mllib" % "1.1.0",
> "org.apache.commons" % "commons-math3" % "3.3"
> )
> =
>
> Still I am getting this error:
>
> >java.lang.NoClassDefFoundError:
> org/apache/commons/math3/random/RandomGenerator
>
> =
>
> The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
> contains the random generator class:
>
>  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
> >org/apache/commons/math3/random/RandomGenerator.class
> >org/apache/commons/math3/random/UniformRandomGenerator.class
> >org/apache/commons/math3/random/SynchronizedRandomGenerator.class
> >org/apache/commons/math3/random/AbstractRandomGenerator.class
> >org/apache/commons/math3/random/RandomGeneratorFactory$1.class
> >org/apache/commons/math3/random/RandomGeneratorFactory.class
> >org/apache/commons/math3/random/StableRandomGenerator.class
> >org/apache/commons/math3/random/NormalizedRandomGenerator.class
> >org/apache/commons/math3/random/JDKRandomGenerator.class
> >org/apache/commons/math3/random/GaussianRandomGenerator.class
>
>
> Please help
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Chitturi Padma
which means the details are not persisted and hence any failures in workers
and master wouldnt start the daemons normally ..right ?

On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User
List]  wrote:

> [Removing dev lists]
>
> You are absolutely correct about that.
>
> Prashant Sharma
>
>
>
> On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch <[hidden email]
> > wrote:
>
>> Hi Spark users/experts,
>>
>> In Spark source code  (Master.scala & Worker.scala), when  registering
>> the worker with master, I see the usage of *persistenceEngine*. When we
>> don't specify spark.deploy.recovery mode explicitly, what is the default
>> value used ? This recovery mode is used to persists and restore the
>> application & worker details.
>>
>>  I see when recovery mode not specified explicitly,
>> *BlackHolePersistenceEngine* being used. Am i right ?
>>
>>
>> Thanks,
>> Padma Ch
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16468.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
I couldnt even see the spark- folder in the default /tmp directory of
local.dir.

On Tue, Sep 23, 2014 at 6:01 PM, Priya Ch 
wrote:

> Is it possible to view the persisted RDD blocks ?
>
> If I use YARN, RDD blocks would be persisted to hdfs then will i be able
> to read the hdfs blocks as i could do in hadoop ?
>
> On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List]
>  wrote:
>
>>  Hi,
>>
>>
>>
>> Spark.local.dir is the one used to write map output data and persistent
>> RDD blocks, but the path of  file has been hashed, so you cannot directly
>> find the persistent rdd block files, but definitely it will be in this
>> folders on your worker node.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* Priya Ch [mailto:[hidden email]
>> ]
>> *Sent:* Tuesday, September 23, 2014 6:31 PM
>> *To:* [hidden email] ;
>> [hidden email] 
>> *Subject:* spark.local.dir and spark.worker.dir not used
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
>> disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
>> location where the rdd has been written to disk. I specified
>> SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
>> using the default /tmp directory, but still couldnt see anything in worker
>> directory andspark ocal directory.
>>
>>
>>
>> I also tried specifying the local dir and worker dir from the spark code
>> while defining the SparkConf as conf.set("spark.local.dir",
>> "/home/padma/sparkdir") but the directories are not used.
>>
>>
>>
>>
>>
>> In general which directories spark would be using for map output files,
>> intermediate writes and persisting rdd to disk ?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Padma Ch
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14885.html
>>  To start a new topic under Apache Spark User List, email
>> ml-node+s1001560n1...@n3.nabble.com
>> To unsubscribe from Apache Spark User List, click here
>> 
>> .
>> NAML
>> 
>>
>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
Is it possible to view the persisted RDD blocks ?

If I use YARN, RDD blocks would be persisted to hdfs then will i be able to
read the hdfs blocks as i could do in hadoop ?

On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List] <
ml-node+s1001560n14885...@n3.nabble.com> wrote:

>  Hi,
>
>
>
> Spark.local.dir is the one used to write map output data and persistent
> RDD blocks, but the path of  file has been hashed, so you cannot directly
> find the persistent rdd block files, but definitely it will be in this
> folders on your worker node.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Priya Ch [mailto:[hidden email]
> ]
> *Sent:* Tuesday, September 23, 2014 6:31 PM
> *To:* [hidden email] ;
> [hidden email] 
> *Subject:* spark.local.dir and spark.worker.dir not used
>
>
>
> Hi,
>
>
>
> I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
> disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
> location where the rdd has been written to disk. I specified
> SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
> using the default /tmp directory, but still couldnt see anything in worker
> directory andspark ocal directory.
>
>
>
> I also tried specifying the local dir and worker dir from the spark code
> while defining the SparkConf as conf.set("spark.local.dir",
> "/home/padma/sparkdir") but the directories are not used.
>
>
>
>
>
> In general which directories spark would be using for map output files,
> intermediate writes and persisting rdd to disk ?
>
>
>
>
>
> Thanks,
>
> Padma Ch
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14885.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Chitturi Padma
Hi,

I have similar problem. I need matrix operations such as dot product ,
cross product , transpose, matrix multiplication to be performed on Spark.
Does spark has inbuilt API to support these?
I see matrix factorization implementation in mlib.


On Fri, Aug 8, 2014 at 12:38 PM, yaochunnan [via Apache Spark User List] <
ml-node+s1001560n11765...@n3.nabble.com> wrote:

> I think the eigenvalues and eigenvectors you are talking about is that of
> M^T*M or M*M^T, if we get M=U*s*V^T as SVD. What I want is to get
> eigenvectors and eigenvalues of M itself. Is this my misunderstanding of
> linear algebra or the API?
>
> [image: M^{*} M = V \Sigma^{*} U^{*}\, U \Sigma V^{*} = V (\Sigma^{*}
> \Sigma) V^{*}\,] [image: M M^{*} = U \Sigma V^{*} \, V \Sigma^{*} U^{*} =
> U (\Sigma \Sigma^{*}) U^{*}\,]
>
>
>
> 2014-08-08 11:19 GMT+08:00 x <[hidden email]
> >:
>
> U.rows.toArray.take(1).foreach(println) and 
> V.toArray.take(s.size).foreach(println)
>> are not eigenvectors corresponding to the biggest eigenvalue
>> s.toArray(0)*s.toArray(0)?
>>
>> xj @ Tokyo
>>
>>
>> On Fri, Aug 8, 2014 at 12:07 PM, Chunnan Yao <[hidden email]
>> > wrote:
>>
>>> Hi there, what you've suggested are all meaningful. But to make myself
>>> clearer, my essential problems are:
>>> 1. My matrix is asymmetric, and it is a probabilistic adjacency matrix,
>>> whose entries(a_ij) represents the likelihood that user j will broadcast
>>> the information generated by user i. Apparently, a_ij and a_ji is
>>> different, caus I love you doesn't necessarily mean you love me(What a sad
>>> story~). All entries are real.
>>> 2. I know I can get eigenvalues through SVD. My problem is I can't get
>>> the corresponding eigenvectors, which requires solving equations, and I
>>> also need eigenvectors in my calculation.In my simulation of this paper, I
>>> only need the biggest eigenvalues and corresponding eigenvectors.
>>> The paper posted by Shivaram Venkataraman is also concerned about
>>> symmetric matrix. Could any one help me out?
>>>
>>>
>>> 2014-08-08 9:41 GMT+08:00 x <[hidden email]
>>> >:
>>>
>>>  The SVD computed result already contains descending order of singular
 values, you can get the biggest eigenvalue.

 ---

   val svd = matrix.computeSVD(matrix.numCols().toInt, computeU = true)
   val U: RowMatrix = svd.U
   val s: Vector = svd.s
   val V: Matrix = svd.V

   U.rows.toArray.take(1).foreach(println)

   println(s.toArray(0)*s.toArray(0))

   println(V.toArray.take(s.size).foreach(println))

 ---

 xj @ Tokyo


 On Fri, Aug 8, 2014 at 3:06 AM, Shivaram Venkataraman <[hidden email]
 > wrote:

> If you just want to find the top eigenvalue / eigenvector you can do
> something like the Lanczos method. There is a description of a MapReduce
> based algorithm in Section 4.2 of [1]
>
> [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf
>
>
> On Thu, Aug 7, 2014 at 10:54 AM, Li Pu <[hidden email]
> > wrote:
>
>> @Miles, the latest SVD implementation in mllib is partially
>> distributed. Matrix-vector multiplication is computed among all workers,
>> but the right singular vectors are all stored in the driver. If your
>> symmetric matrix is n x n and you want the first k eigenvalues, you will
>> need to fit n x k doubles in driver's memory. Behind the scene, it calls
>> ARPACK to compute eigen-decomposition of A^T A. You can look into the
>> source code for the details.
>>
>> @Sean, the SVD++ implementation in graphx is not the canonical
>> definition of SVD. It doesn't have the orthogonality that SVD holds. But 
>> we
>> might want to use graphx as the underlying matrix representation for
>> mllib.SVD to address the problem of skewed entry distribution.
>>
>>
>> On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks <[hidden email]
>> > wrote:
>>
>>> Reza Zadeh has contributed the distributed implementation of
>>> (Tall/Skinny) SVD (
>>> http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
>>> which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in 
>>> Spark
>>> 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your
>>> data is sparse (which it often is in social networks), you may have 
>>> better
>>> luck with this.
>>>
>>> I haven't tried the GraphX implementation, but those algorithms are
>>> often well-suited for power-law distributed graphs as you might see in
>>> social networks.
>>>
>>> FWIW, I believe you need to square elements of

Standalone cluster on Windows

2014-07-08 Thread Chitturi Padma
Hi, 

I wanted to set up standalone cluster on windows machine. But unfortunately,
spark-master.cmd file is not available. Can  someone suggest how to proceed
or is spark-1.0.0 has missed spark-master.cmd file ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-cluster-on-Windows-tp9135.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.