Re: MLLib SVMWithSGD is failing for large dataset

2016-07-07 Thread Chitturi Padma
Hi Sarath,

By any chance have you resolved this issue ?

Thanks,
Padma CH

On Tue, Apr 28, 2015 at 11:20 PM, sarath [via Apache Spark User List] <
ml-node+s1001560n22694...@n3.nabble.com> wrote:

>
> I am trying to train a large dataset consisting of 8 million data points
> and 20 million features using SVMWithSGD. But it is failing after running
> for some time. I tried increasing num-partitions, driver-memory,
> executor-memory, driver-max-resultSize. Also I tried by reducing the size
> of dataset from 8 million to 25K (keeping number of features same 20 M).
> But after using the entire 64GB driver memory for 20 to 30 min it failed.
>
> I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM).
> executor-memory - 60G
> driver-memory - 60G
> num-executors - 64
> And other default settings
>
> This is the error log :
>
> 15/04/20 11:51:09 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> 15/04/20 11:56:11 WARN TransportChannelHandler: Exception in connection
> from xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:11 ERROR TransportResponseHandler: Still have 7 requests
> outstanding when connection from xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029 is
> closed
> 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> ...
> 15/04/20 11:56:12 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:290)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
>
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
>
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused:
> xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma
By default spark uses 2 executors with one core each, have you allocated
more executors using the command line args as -
--num-executors 25 --executor-cores x  ???

What do you mean by the difference between the nodes is huge ?

Regards,
Padma Ch

On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via Apache Spark User List] <
ml-node+s1001560n2650...@n3.nabble.com> wrote:

> Hi,
>
> I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
> I observe a very strange issue.
> I run a simple job that reads about 1TB of json logs from a remote HDFS
> cluster and converts them to parquet, then saves them to the local HDFS of
> the Hadoop cluster.
>
> I run it with 25 executors with sufficient resources. However the strange
> thing is that the job only uses 2 executors to do most of the read work.
>
> For example when I go to the Executors' tab in the Spark UI and look at
> the "Input" column, the difference between the nodes is huge, sometimes 20G
> vs 120G.
>
> 1. What is the cause for this behaviour?
> 2. Any ideas how to achieve a more balanced performance?
>
> Thanks,
> Borislav
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502p26503.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Chitturi Padma
*Something like below ...*

*Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space at
org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
*
*org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
*
*org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294) *
*org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273) *
* org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:324) *
*org.apache.spark.io.SnappyOutputStreamWrapper.close(CompressionCodec.scala:203)
*
* com.esotericsoftware.kryo.io.Output.close(Output.java:168) *
*org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
*


On Mon, Mar 14, 2016 at 8:35 PM, adamreith [via Apache Spark User List] <
ml-node+s1001560n26483...@n3.nabble.com> wrote:

> What you mean? I 've pasted the output in the same format used by spark...
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Exception-in-my-spark-streaming-application-tp26479p26483.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Exception-in-my-spark-streaming-application-tp26479p26485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Chitturi Padma
Hi,

 Can you please try to show the stack trace line by line, because its bit
difficult to read the entire paragraph and make sense out of it .

On Mon, Mar 14, 2016 at 3:11 PM, adamreith [via Apache Spark User List] <
ml-node+s1001560n26479...@n3.nabble.com> wrote:

> Hi,
>
> I'm using spark 1.4.1 and i have a simple application that create a
> dstream that read data from kafka and apply a filter transformation on it.
> After more or less a day throw the following exception:
>
> *Exception in thread "dag-scheduler-event-loop"
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
> at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
> at
> org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
> at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273)
> at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:324)
> at
> org.apache.spark.io.SnappyOutputStreamWrapper.close(CompressionCodec.scala:203)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:168) at
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
> at
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
> at
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1291) at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
> at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1426)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) [Stage
> 53513:>  (0 + 0) /
> 4]Exception in thread "JobGenerator" java.lang.OutOfMemoryError: GC
> overhead limit exceeded at
> sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41) at
> java.net.URL.openConnection(URL.java:972) at
> java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:237) at
> java.lang.Class.getResourceAsStream(Class.java:2223) at
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:38)
> at
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:98)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at
> org.apache.spark.SparkContext.clean(SparkContext.scala:1893) at
> org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at
> org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at
> org.apache.spark.rdd.RDD.map(RDD.scala:293) at
> org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.MappedDStream$$anonfun$compute$1.apply(MappedDStream.scala:35)
> at scala.Option.map(Option.scala:145) at
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at 

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
Hi All,

  I  am facing the same issue. taking k values from 60 to 120 incrementing
by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
around 2.5h on a 800 MB data set with 38 dimensions.

On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] <
ml-node+s1001560n2227...@n3.nabble.com> wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did.
> I want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: forgetfulness in clustering algorithm

2016-03-12 Thread Chitturi Padma
Hi,

  I am interested in the Streaming k-means algorithm and the parameter
forgetfulness. Please some one can throw light on this ?

On Wed, Jul 29, 2015 at 11:23 AM, AmmarYasir [via Apache Spark User List] <
ml-node+s1001560n24050...@n3.nabble.com> wrote:

>
> I read the post regarding clustering " Introducing streaming k-means in
> park 1.2 " at https://databricks.com/blog/.
>
> I was interested in some detail regarding " the forgetfulness" in
> clustering techniques please can
> any one suggested me any material e.g research papers or article that
> would be worthwhile.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/forgetfulness-in-clustering-algorithm-tp24050.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/forgetfulness-in-clustering-algorithm-tp24050p26466.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
If you want to do processing in parallel, never use collect or any action
such as count or first, they compute the result and bring it back to
driver. rdd.map does processing in parallel. Once you have processed rdd
then save it to DB.

 rdd.foreach executes on the workers, Infact, it returns unit.



On Wed, Feb 24, 2016 at 11:56 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26325...@n3.nabble.com> wrote:

> @Chitturi-Thanks a lot for replying
>
> 2 followup questions :
>
> 1. what if I am not collecting Rdd, then will Rdd.foreach() and Rdd.map()
> do processing in parallel ?
>
>
> 2. Let's say I have to get the results first and then do something before
> saving them into database. But I want to do that in parallel? How should I
> do it ? I am using Rdd.collect().foreach(), but it is not doing
> processing in parallel.
>
> Regards
> Anurag
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26325.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-24 Thread Chitturi Padma
Hi,

 I didn't get the point that you want to mention i.e "distribute
computation across nodes by restricting parallelism on each node". Do you
mean per node you are expecting only one task to run ?
Can you please paste the configuration changes you made ?

On Wed, Feb 24, 2016 at 11:24 PM, firemonk91 [via Apache Spark User List] <
ml-node+s1001560n26323...@n3.nabble.com> wrote:

> Can you paste the logs as well.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Restricting-number-of-cores-not-resulting-in-reduction-in-parallelism-tp26319p26323.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Restricting-number-of-cores-not-resulting-in-reduction-in-parallelism-tp26319p26324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
rdd.collect() never does any processing on the workers. It brings the
entire rdd as an in-memory collection back to driver

On Wed, Feb 24, 2016 at 10:58 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26320...@n3.nabble.com> wrote:

> Hi Everyone
>
> I am new to Scala and Spark.
>
> I want to know
>
> 1. does Rdd.collect().foreach() do processing in parallel?
>
> 2. does Rdd.collect().map() do processing in parallel ?
>
> Thanks in advance.
> Regards
> Anurag
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-collect-foreach-vs-rdd-collect-map-tp26320p26322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Read from kafka after application is restarted

2016-02-23 Thread Chitturi Padma
Hi Vaibhav,

  As you said, from the second link,  I can figure out that, it is not able
to cast the class when it is trying to read from checkpoint. Can you try
explicit casting like asInstanceOf[T] for the broad casted value ?

>From the bug, looks like it affects version 1.5. Try sample wordcount
program with same spark 1.3 version and try to re-produce the error. If you
are able to, then change the version of spark to 1.5 or 1.4 and see if the
issue is seen. If not, then possibly it should have been fixed in latest
version.

As direct API is introduced in spark 1.3, it is likely to have bugs.

On Tue, Feb 23, 2016 at 5:09 PM, vaibhavrtk1 [via Apache Spark User List] <
ml-node+s1001560n26304...@n3.nabble.com> wrote:

> Hello
>
> I have tried with Direct API but i am getting this an error, which is
> being tracked here https://issues.apache.org/jira/browse/SPARK-5594
>
> I also tried using Receiver approach with Write Ahead Logs ,then this
> issue comes
> https://issues.apache.org/jira/browse/SPARK-12407
>
> In both cases it seems it is not able to get the broadcasted variable from
> checkpoint directory.
> Attached is the screenshot of errors I faced with both approaches.
>
> What do you guys suggest for solving this issue?
>
>
> *Vaibhav Nagpal*
> 9535433788
> 
>
> On Tue, Feb 23, 2016 at 1:50 PM, Gideon [via Apache Spark User List] <[hidden
> email] > wrote:
>
>> Regarding the spark streaming receiver - can't you just use Kafka direct
>> receivers with checkpoints? So when you restart your application it will
>> read where it last stopped and continue from there
>> Regarding limiting the number of messages - you can do that by setting
>> spark.streaming.receiver.maxRate. Read more about it here
>> 
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26303.html
>> To unsubscribe from Read from kafka after application is restarted, click
>> here.
>> NAML
>> 
>>
>
>
> *Capture.JPG* (222K) Download Attachment
> 
> *Capture1.JPG* (169K) Download Attachment
> 
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26304.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Read from kafka after application is restarted

2016-02-22 Thread Chitturi Padma
Hi Vaibhav,

 Please try with Kafka direct API approach. Is this not working ?

-- Padma Ch


On Tue, Feb 23, 2016 at 12:36 AM, vaibhavrtk1 [via Apache Spark User List] <
ml-node+s1001560n26291...@n3.nabble.com> wrote:

> Hi
>
> I am using kafka with spark streaming 1.3.0 . When the spark application
> is not running kafka is still receiving messages. When i start the
> application those messages which have already been received when spark was
> not running are not processed. I am using a unreliable receiver based
> approach.
>
> What can I do to process earlier messages also, which came while
> application was shut down?
>
> PS: If application was down for a long time can i also limit the max
> number of message consumed in one batch interval?
>
> Regards
> Vaibhav
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-kafka-after-application-is-restarted-tp26291p26292.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does Spark satisfy my requirements?

2016-02-22 Thread Chitturi Padma
Hi,

When you say that you want to produce new information, are you looking
forward to put the processed data in other consumers ?
Spark will be definitely the choice for real-time streaming computations.
Are you looking for near-real time processing or exactly real-time
processing ?


On Sun, Feb 21, 2016 at 6:49 PM, ebrahim [via Apache Spark User List] <
ml-node+s1001560n26283...@n3.nabble.com> wrote:

> Dear all,
>
> I am involved in a project which aims to gather data of about 400 MB per
> second, process the received data, produce new information, and then
> forward them to other consumers. All this process should take about 200 ms.
>
> My question is that does Spark is a good choice for me considering the
> above mentioned scenario?
>
> Best,
> Ebrahim
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-satisfy-my-requirements-tp26283.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-satisfy-my-requirements-tp26283p26290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job Opportunity in London

2015-03-30 Thread Chitturi Padma
Hi,

I am interested in this opportunity. I am working as Research Engineer in
Impetus Technologies, Bangalore, India. In fact we implemented Distributed
Deep Learning on Spark. Will share my CV if you are interested.
Please visit the below link:

http://www.signalprocessingsociety.org/newsletter/2015/02/implementing-a-distributed-deep-learning-network-over-spark/
​




On Mon, Mar 30, 2015 at 3:10 PM, janboehm [via Apache Spark User List] 
ml-node+s1001560n22290...@n3.nabble.com wrote:

 Dear Spark-Enthusiasts,
 I am looking for a Research Assistant in Cloud-based Machine Learning to
 join my team at University College London. The main role will be to
 implement distributed algorithms for feature extraction, machine learning
 and classification for large-scale geo-spatial data. This post is funded by
 a collaborative EU project of prestigious European institutions. The
 details of the UCL job advert (Ref:1458193) are online at
 http://tinyurl.com/pzhndwm

 Kind Regards,
 Jan Boehm
 Senior Lecturer in Photogrammetry  3D Imaging
 University College London


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Opportunity-in-London-tp22290.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-Opportunity-in-London-tp22290p22298.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues.

On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List] 
ml-node+s1001560n21197...@n3.nabble.com wrote:

 Is spark 1.2 is compatibly with HDP 2.1

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.2 compatibility

2015-01-17 Thread Chitturi Padma
It worked for me. spark 1.2.0 with hadoop 2.2.0

On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via Apache Spark User List] 
ml-node+s1001560n21207...@n3.nabble.com wrote:

 Hi all,

 Thanks for your contribution. We have checked and confirmed that HDP 2.1
 YARN don't work with Spark 1.2

 On Sat, Jan 17, 2015 at 9:11 AM, bhavya teja potineni [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=21207i=0 wrote:

 Hi

 Did you try using spark 1.2 on hdp 2.1 YARN

 Can you please go thru the thread

 http://apache-spark-user-list.1001560.n3.nabble.com/Troubleshooting-Spark-tt21189.html
 and check where I am going wrong. As my word count program is erroring
 out when using spark 1.2 using YARN but its getting executed using spark
 0.9.1

 On Sat, Jan 17, 2015 at 5:55 AM, Chitturi Padma [via Apache Spark User
 List] [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=21207i=1 wrote:

 Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues.

 On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List]
 [hidden email] http:///user/SendEmail.jtp?type=nodenode=21202i=0
 wrote:

 Is spark 1.2 is compatibly with HDP 2.1

 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http:///user/SendEmail.jtp?type=nodenode=21202i=1
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21202.html
  To unsubscribe from spark 1.2 compatibility, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21207.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197p21208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-20 Thread Chitturi Padma
Hi,

I tried with try catch  blocks. Infact, inside mapPartitionsWithIndex,
method is invoked which does the operation. I put the operations inside the
function in try...catch block but thats of no use...still the error
persists. Even I commented all the operations and a simple print statement
inside the method is not executed. The data size is 542 MB. hdfs block size
is 64 MB and it has got 9 blocks. I used a 2 node cluster with rep.factor
2.

When is see the logs, it seemed to me like it tried to launch tasks on the
other node ..but TaskSetManager has encountered Null pointer exception and
the job is aborted. Is this the problem with mapPartitionWithIndex ?

The same operations when performed with map transformation, it got executed
with no issues.


Please let me know if anyone has the same problem ?

Thanks,
Padma Ch

On Fri, Nov 14, 2014 at 7:42 PM, Akhil [via Apache Spark User List] 
ml-node+s1001560n18936...@n3.nabble.com wrote:

 It shows nullPointerException, your data could be corrupted? Try putting a
 try catch inside the operation that you are doing, Are you running the
 worker process on the master node also? If not, then only 1 node will be
 doing the processing. If yes, then try setting the level of parallelism and
 number of partitions while creating/transforming the RDD.

 Thanks
 Best Regards

 On Fri, Nov 14, 2014 at 5:17 PM, Priya Ch [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18936i=0 wrote:

 Hi All,

   We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is
 having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set
 up hdfs which has 2 TB capacity and the block size is 256 mb   When we try
 to process 1 gb file on spark, we see the following exception

 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 0.0 in
 stage 0.0 (TID 0, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 1.0 in
 stage 0.0 (TID 1, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 2.0 in
 stage 0.0 (TID 2, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:43 INFO cluster.SparkDeploySchedulerBackend: Registered
 executor: 
 Actor[akka.tcp://sparkExecutor@IMPETUS-DSRV02:41124/user/Executor#539551156]
 with ID 0
 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
 manager NODE-DSRV05.impetus.co.in:60432 with 2.1 GB RAM
 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
 manager NODE-DSRV02:47844 with 2.1 GB RAM
 14/11/14 17:01:43 INFO network.ConnectionManager: Accepted connection
 from [NODE-DSRV05.impetus.co.in/192.168.145.195:51447]
 14/11/14 17:01:43 INFO network.SendingConnection: Initiating connection
 to [NODE-DSRV05.impetus.co.in/192.168.145.195:60432]
 14/11/14 17:01:43 INFO network.SendingConnection: Connected to [
 NODE-DSRV05.impetus.co.in/192.168.145.195:60432], 1 messages pending
 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 17.1 KB, free: 2.1
 GB)
 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 14.1 KB, free: 2.1
 GB)
 14/11/14 17:01:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
 0.0 (TID 0, NODE-DSRV05.impetus.co.in): java.lang.NullPointerException:
 org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)
 org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 java.lang.Thread.run(Thread.java:722)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.1 in
 stage 0.0 (TID 3, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 1.0 in stage
 0.0 (TID 1) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 1]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.0 in stage
 0.0 (TID 2) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 2]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.1 in
 stage 0.0 (TID 4, 

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
Include the commons-math3/3.3 in class path while submitting jar to spark
cluster. Like..
spark-submit --driver-class-path maths3.3jar --class MainClass --master
spark cluster url appjar

On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User
List] ml-node+s1001560n19055...@n3.nabble.com wrote:

 My sbt file for the project includes this:

 libraryDependencies ++= Seq(
 org.apache.spark  %% spark-core  % 1.1.0,
 org.apache.spark  %% spark-mllib % 1.1.0,
 org.apache.commons % commons-math3 % 3.3
 )
 =

 Still I am getting this error:

 java.lang.NoClassDefFoundError:
 org/apache/commons/math3/random/RandomGenerator

 =

 The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
 contains the random generator class:

  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
 org/apache/commons/math3/random/RandomGenerator.class
 org/apache/commons/math3/random/UniformRandomGenerator.class
 org/apache/commons/math3/random/SynchronizedRandomGenerator.class
 org/apache/commons/math3/random/AbstractRandomGenerator.class
 org/apache/commons/math3/random/RandomGeneratorFactory$1.class
 org/apache/commons/math3/random/RandomGeneratorFactory.class
 org/apache/commons/math3/random/StableRandomGenerator.class
 org/apache/commons/math3/random/NormalizedRandomGenerator.class
 org/apache/commons/math3/random/JDKRandomGenerator.class
 org/apache/commons/math3/random/GaussianRandomGenerator.class


 Please help


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
As you are using sbt ..u need not put in ~/.m2/repositories for maven.
Include the jar explicitly using the option
--driver-class-path while submitting the jar to spark cluster

On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User
List] ml-node+s1001560n1907...@n3.nabble.com wrote:

 It's still not working. Keep getting the same error.

 I even deleted the commons-math3/* folder containing the jar. And then
 under the directory org/apache/commons/ made a folder called 'math3'
 and put the commons-math3-3.3.jar in it.
 Still its not working.

 I also tried sc.addJar(/path/to/jar) within spark-shell and in my
 project sourcefile
 It still didn't import the jar at both locations.
 More

 Any fixes? Please help

 On Mon, Nov 17, 2014 at 2:14 PM, Chitturi Padma [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19073i=0 wrote:

 Include the commons-math3/3.3 in class path while submitting jar to spark
 cluster. Like..
 spark-submit --driver-class-path maths3.3jar --class MainClass --master
 spark cluster url appjar

 On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark
 User List] [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19057i=0 wrote:

 My sbt file for the project includes this:

 libraryDependencies ++= Seq(
 org.apache.spark  %% spark-core  % 1.1.0,
 org.apache.spark  %% spark-mllib % 1.1.0,
 org.apache.commons % commons-math3 % 3.3
 )
 =

 Still I am getting this error:

 java.lang.NoClassDefFoundError:
 org/apache/commons/math3/random/RandomGenerator

 =

 The jar at location:
 ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random
 generator class:

  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
 org/apache/commons/math3/random/RandomGenerator.class
 org/apache/commons/math3/random/UniformRandomGenerator.class
 org/apache/commons/math3/random/SynchronizedRandomGenerator.class
 org/apache/commons/math3/random/AbstractRandomGenerator.class
 org/apache/commons/math3/random/RandomGeneratorFactory$1.class
 org/apache/commons/math3/random/RandomGeneratorFactory.class
 org/apache/commons/math3/random/StableRandomGenerator.class
 org/apache/commons/math3/random/NormalizedRandomGenerator.class
 org/apache/commons/math3/random/JDKRandomGenerator.class
 org/apache/commons/math3/random/GaussianRandomGenerator.class


 Please help


 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
  To start a new topic under Apache Spark User List, email [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19057i=1
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: RandomGenerator class not found
 exception
 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19073.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Chitturi Padma
which means the details are not persisted and hence any failures in workers
and master wouldnt start the daemons normally ..right ?

On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User
List] ml-node+s1001560n16468...@n3.nabble.com wrote:

 [Removing dev lists]

 You are absolutely correct about that.

 Prashant Sharma



 On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16468i=0 wrote:

 Hi Spark users/experts,

 In Spark source code  (Master.scala  Worker.scala), when  registering
 the worker with master, I see the usage of *persistenceEngine*. When we
 don't specify spark.deploy.recovery mode explicitly, what is the default
 value used ? This recovery mode is used to persists and restore the
 application  worker details.

  I see when recovery mode not specified explicitly,
 *BlackHolePersistenceEngine* being used. Am i right ?


 Thanks,
 Padma Ch




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16468.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
Is it possible to view the persisted RDD blocks ?

If I use YARN, RDD blocks would be persisted to hdfs then will i be able to
read the hdfs blocks as i could do in hadoop ?

On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List] 
ml-node+s1001560n14885...@n3.nabble.com wrote:

  Hi,



 Spark.local.dir is the one used to write map output data and persistent
 RDD blocks, but the path of  file has been hashed, so you cannot directly
 find the persistent rdd block files, but definitely it will be in this
 folders on your worker node.



 Thanks

 Jerry



 *From:* Priya Ch [mailto:[hidden email]
 http://user/SendEmail.jtp?type=nodenode=14885i=0]
 *Sent:* Tuesday, September 23, 2014 6:31 PM
 *To:* [hidden email] http://user/SendEmail.jtp?type=nodenode=14885i=1;
 [hidden email] http://user/SendEmail.jtp?type=nodenode=14885i=2
 *Subject:* spark.local.dir and spark.worker.dir not used



 Hi,



 I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
 disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
 location where the rdd has been written to disk. I specified
 SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
 using the default /tmp directory, but still couldnt see anything in worker
 directory andspark ocal directory.



 I also tried specifying the local dir and worker dir from the spark code
 while defining the SparkConf as conf.set(spark.local.dir,
 /home/padma/sparkdir) but the directories are not used.





 In general which directories spark would be using for map output files,
 intermediate writes and persisting rdd to disk ?





 Thanks,

 Padma Ch


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14885.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Chitturi Padma
I couldnt even see the spark-id folder in the default /tmp directory of
local.dir.

On Tue, Sep 23, 2014 at 6:01 PM, Priya Ch learnings.chitt...@gmail.com
wrote:

 Is it possible to view the persisted RDD blocks ?

 If I use YARN, RDD blocks would be persisted to hdfs then will i be able
 to read the hdfs blocks as i could do in hadoop ?

 On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List]
 ml-node+s1001560n14885...@n3.nabble.com wrote:

  Hi,



 Spark.local.dir is the one used to write map output data and persistent
 RDD blocks, but the path of  file has been hashed, so you cannot directly
 find the persistent rdd block files, but definitely it will be in this
 folders on your worker node.



 Thanks

 Jerry



 *From:* Priya Ch [mailto:[hidden email]
 http://user/SendEmail.jtp?type=nodenode=14885i=0]
 *Sent:* Tuesday, September 23, 2014 6:31 PM
 *To:* [hidden email] http://user/SendEmail.jtp?type=nodenode=14885i=1;
 [hidden email] http://user/SendEmail.jtp?type=nodenode=14885i=2
 *Subject:* spark.local.dir and spark.worker.dir not used



 Hi,



 I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
 disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
 location where the rdd has been written to disk. I specified
 SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
 using the default /tmp directory, but still couldnt see anything in worker
 directory andspark ocal directory.



 I also tried specifying the local dir and worker dir from the spark code
 while defining the SparkConf as conf.set(spark.local.dir,
 /home/padma/sparkdir) but the directories are not used.





 In general which directories spark would be using for map output files,
 intermediate writes and persisting rdd to disk ?





 Thanks,

 Padma Ch


 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14885.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp14881p14887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Chitturi Padma
Hi,

I have similar problem. I need matrix operations such as dot product ,
cross product , transpose, matrix multiplication to be performed on Spark.
Does spark has inbuilt API to support these?
I see matrix factorization implementation in mlib.


On Fri, Aug 8, 2014 at 12:38 PM, yaochunnan [via Apache Spark User List] 
ml-node+s1001560n11765...@n3.nabble.com wrote:

 I think the eigenvalues and eigenvectors you are talking about is that of
 M^T*M or M*M^T, if we get M=U*s*V^T as SVD. What I want is to get
 eigenvectors and eigenvalues of M itself. Is this my misunderstanding of
 linear algebra or the API?

 [image: M^{*} M = V \Sigma^{*} U^{*}\, U \Sigma V^{*} = V (\Sigma^{*}
 \Sigma) V^{*}\,] [image: M M^{*} = U \Sigma V^{*} \, V \Sigma^{*} U^{*} =
 U (\Sigma \Sigma^{*}) U^{*}\,]



 2014-08-08 11:19 GMT+08:00 x [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=0:

 U.rows.toArray.take(1).foreach(println) and 
 V.toArray.take(s.size).foreach(println)
 are not eigenvectors corresponding to the biggest eigenvalue
 s.toArray(0)*s.toArray(0)?

 xj @ Tokyo


 On Fri, Aug 8, 2014 at 12:07 PM, Chunnan Yao [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=1 wrote:

 Hi there, what you've suggested are all meaningful. But to make myself
 clearer, my essential problems are:
 1. My matrix is asymmetric, and it is a probabilistic adjacency matrix,
 whose entries(a_ij) represents the likelihood that user j will broadcast
 the information generated by user i. Apparently, a_ij and a_ji is
 different, caus I love you doesn't necessarily mean you love me(What a sad
 story~). All entries are real.
 2. I know I can get eigenvalues through SVD. My problem is I can't get
 the corresponding eigenvectors, which requires solving equations, and I
 also need eigenvectors in my calculation.In my simulation of this paper, I
 only need the biggest eigenvalues and corresponding eigenvectors.
 The paper posted by Shivaram Venkataraman is also concerned about
 symmetric matrix. Could any one help me out?


 2014-08-08 9:41 GMT+08:00 x [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=2:

  The SVD computed result already contains descending order of singular
 values, you can get the biggest eigenvalue.

 ---

   val svd = matrix.computeSVD(matrix.numCols().toInt, computeU = true)
   val U: RowMatrix = svd.U
   val s: Vector = svd.s
   val V: Matrix = svd.V

   U.rows.toArray.take(1).foreach(println)

   println(s.toArray(0)*s.toArray(0))

   println(V.toArray.take(s.size).foreach(println))

 ---

 xj @ Tokyo


 On Fri, Aug 8, 2014 at 3:06 AM, Shivaram Venkataraman [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=3 wrote:

 If you just want to find the top eigenvalue / eigenvector you can do
 something like the Lanczos method. There is a description of a MapReduce
 based algorithm in Section 4.2 of [1]

 [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf


 On Thu, Aug 7, 2014 at 10:54 AM, Li Pu [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=4 wrote:

 @Miles, the latest SVD implementation in mllib is partially
 distributed. Matrix-vector multiplication is computed among all workers,
 but the right singular vectors are all stored in the driver. If your
 symmetric matrix is n x n and you want the first k eigenvalues, you will
 need to fit n x k doubles in driver's memory. Behind the scene, it calls
 ARPACK to compute eigen-decomposition of A^T A. You can look into the
 source code for the details.

 @Sean, the SVD++ implementation in graphx is not the canonical
 definition of SVD. It doesn't have the orthogonality that SVD holds. But 
 we
 might want to use graphx as the underlying matrix representation for
 mllib.SVD to address the problem of skewed entry distribution.


 On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=5 wrote:

 Reza Zadeh has contributed the distributed implementation of
 (Tall/Skinny) SVD (
 http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
 which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in 
 Spark
 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your
 data is sparse (which it often is in social networks), you may have 
 better
 luck with this.

 I haven't tried the GraphX implementation, but those algorithms are
 often well-suited for power-law distributed graphs as you might see in
 social networks.

 FWIW, I believe you need to square elements of the sigma matrix from
 the SVD to get the eigenvalues.




 On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11765i=6 wrote:

 (-incubator, +user)

 If your matrix is symmetric (and real I presume), and if my linear
 algebra isn't too rusty, then its SVD is its eigendecomposition. The
 SingularValueDecomposition object you get back has U and V, both of
 which have columns that are the eigenvectors.

 There are a few SVDs in the 

Standalone cluster on Windows

2014-07-09 Thread Chitturi Padma
Hi, 

I wanted to set up standalone cluster on windows machine. But unfortunately,
spark-master.cmd file is not available. Can  someone suggest how to proceed
or is spark-1.0.0 has missed spark-master.cmd file ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-cluster-on-Windows-tp9135.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.