Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

Thanks
Best Regards

On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

>
> Hi All,
>
> I have written a spark program on my dev box ,
>IDE:Intellij
>scala version:2.11.7
>spark verison:1.6.1
>
> run fine from IDE, by providing proper input and output paths including
>  master.
>
> But when i try to deploy the code in my cluster made of below,
>
>Spark version:1.6.1
> built from source pkg using scala 2.11
> But when i try spark-shell on cluster i get scala version to be
> 2.10.5
>  hadoop yarn cluster 2.6.0
>
> and with additional options,
>
> --executor-memory
> --total-executor-cores
> --deploy-mode cluster/client
> --master yarn
>
> i get Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at com.movoto.SparkPost$.main(SparkPost.scala:36)
> at com.movoto.SparkPost.main(SparkPost.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> i understand this to be a scala version issue, as i have faced this before.
>
> Is there something that i have change and try  things to get the same
> program running on cluster.
>
> Regards,
> Satyajit.
>
>


Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does?

Thanks
Best Regards

On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh 
wrote:

> Hi All,
>
> I have an RDD having the data in  the following form :
>
> tempRDD: RDD[(String, (String, String))]
>
> (brand , (product, key))
>
> ("amazon",("book1","tech"))
>
> ("eBay",("book1","tech"))
>
> ("barns",("book","tech"))
>
> ("amazon",("book2","tech"))
>
>
> I would like to group the data by Brand and would like to get the result
> set in the following format :
>
> resultSetRDD : RDD[(String, List[(String), (String)]
>
> i tried using the aggregateByKey but kind  of not getting how to achieve
> this. OR is there any other way to achieve this?
>
> val resultSetRDD  = tempRDD.aggregateByKey("")({case (aggr , value) =>
> aggr + String.valueOf(value) + ","}, (aggr1, aggr2) => aggr1 + aggr2)
>
> resultSetRDD = (amazon,("book1","tech"),("book2","tech"))
>
> Thanks,
>
> Suniti
>


Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See
https://issues.apache.org/jira/browse/SPARK-2356

Thanks
Best Regards

On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman  wrote:

> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
> i was trying to use databricks csv format to read csv file. i used the
> below command.
>
> [image: Inline image 1]
>
> I got null pointer exception. Any help would be greatly appreciated.
>
> [image: Inline image 2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the
pipeline where you will properly standardize all the values (like replacing
doc with doctor) for all the columns before the join.

Thanks
Best Regards

On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh 
wrote:

> Hi All,
>
> I have two tables with same schema but different data. I have to join the
> tables based on one column and then do a group by the same column name.
>
> now the data in that column in two table might/might not exactly match.
> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
> = "doc") doctor and doc are actually same titles.
>
> From performance point of view where i have data volume in TB , i am not
> sure if i can achieve this using the sql statement. What would be the best
> approach of solving this problem. Should i look for MLLIB apis?
>
> Spark Gurus any pointers?
>
> Thanks,
> Suniti
>
>
>


Re: How do we run that PR auto-close script again?

2016-02-22 Thread Akhil Das
This?
http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html

Thanks
Best Regards

On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen  wrote:

> I know Patrick told us at some point, but I can't find the email or
> wiki that describes how to run the script that auto-closes PRs with
> "do you mind closing this PR". Does anyone know? I think it's been a
> long time since it was run.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using distinct count in over clause

2016-01-27 Thread Akhil Das
Does it support over? I couldn't find it in the documentation
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

Thanks
Best Regards

On Fri, Jan 22, 2016 at 2:31 PM, 汪洋  wrote:

> I think it cannot be right.
>
> 在 2016年1月22日,下午4:53,汪洋  写道:
>
> Hi,
>
> Do we support distinct count in the over clause in spark sql?
>
> I ran a sql like this:
>
> select a, count(distinct b) over ( order by a rows between unbounded
> preceding and current row) from table limit 10
>
> Currently, it return an error says: expression ‘a' is neither present in
> the group by, nor is it an aggregate function. Add to group by or wrap in
> first() if you don't care which value you get.;
>
> Yang
>
>
>


Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries.
https://github.com/ssavvides/tpch-spark

Thanks
Best Regards

On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa 
wrote:

> Hi,
> I have downloaded the Amplab benchmark dataset from
> s3n://big-data-benchmark/pavlo/text/tiny, but I don't know how to generate
> a
> set of random mixed queries of different types like scan,aggregate and
> join.
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Generate-Amplab-queries-set-tp16071.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/

You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/

Thanks
Best Regards

On Wed, Dec 16, 2015 at 6:46 AM, Judy Nash 
wrote:

> Hi all,
>
>
>
> Does anyone know of any effort from the community on security testing
> spark clusters.
>
> I.e.
>
> Static source code analysis to find security flaws
>
> Penetration testing to identify ways to compromise spark cluster
>
> Fuzzing to crash spark
>
>
>
> Thanks,
>
> Judy
>
>
>


Re: Spark basicOperators

2015-12-18 Thread Akhil Das
You can pretty much measure it from the Event timeline listed in the driver
ui, You can click on jobs/tasks and get the time that it took for each of
it from there.

Thanks
Best Regards

On Thu, Dec 17, 2015 at 7:27 AM, sara mustafa 
wrote:

> Hi,
>
> The class org.apache.spark.sql.execution.basicOperators.scala contains the
> implementation of multiple operators, how could I measure the execution
> time
> of any operator?
>
> thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-basicOperators-tp15672.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Akhil Das
Not quiet sure whats happening, but its not an issue with multiplication i
guess as the following query worked for me:

trades.select(trades("price")*9.5).show
+-+
|(price * 9.5)|
+-+
|199.5|
|228.0|
|190.0|
|199.5|
|190.0|
|256.5|
|218.5|
|275.5|
|218.5|
..
..


Could it be with the precision? ccing dev list, may be you can open up a
jira for this as it seems to be a bug.

Thanks
Best Regards

On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds 
wrote:

> I hit a weird issue when I tried to multiply to decimals in a select
> (either in scala or as SQL), and Im assuming I must be missing the point.
>
> The issue is fairly easy to recreate with something like the following:
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
>
> case class Trade(quantity: Decimal,price: Decimal)
>
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>
>   Trade(quantity, price)
> }
>
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
>
> trades.select(trades("price")*trades("quantity")).show
>
> sqlContext.sql("select
> price/quantity,price*quantity,price+quantity,price-quantity from
> trades").show
>
> The odd part is if you run it you will see that the addition/division and
> subtraction works but the multiplication returns a null.
>
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
>
> ie.
>
> +--+
>
> |(price * quantity)|
>
> +--+
>
> |  null|
>
> |  null|
>
> |  null|
>
> |  null|
>
> |  null|
>
> +--+
>
>
> +++++
>
> | _c0| _c1| _c2| _c3|
>
> +++++
>
> |0.952380952380952381|null|41.00...|-1.00...|
>
> |1.380952380952380952|null|50.00...|8.00|
>
> |1.272727272727272727|null|50.00...|6.00|
>
> |0.83|null|44.00...|-4.00...|
>
> |1.00|null|58.00...|   0E-18|
>
> +++++
>
>
> Just keen to know what I did wrong?
>
>
> Cheers
>
> P
>
> --
> Philip Dodds
>
>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing  the memory.

Thanks
Best Regards

On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:

> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I can open a bug about it]
>
> Hi,
>
> I have a standalone Spark 1.4.0 cluster with 100s of applications running
> every day.
>
> From time to time, the applications crash with the following error (see
> below)
> But at the same time (and also after that), other applications are
> running, so I can safely assume the master and workers are working.
>
> 1. why is there a NullPointerException? (i can't track the scala stack
> trace to the code, but anyway NPE is usually a obvious bug even if there's
> actually a network error...)
> 2. why can't it connect to the master? (if it's a network timeout, how to
> increase it? i see the values are hardcoded inside AppClient)
> 3. how to recover from this error?
>
>
>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
> logs/error.log
>   java.lang.NullPointerException NullPointerException
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>   ERROR 01-11 15:32:55,603   SparkContext - Error
> initializing SparkContext. ERROR
>   java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
>   at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>   at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>   at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>   at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>


Re: Guidance to get started

2015-11-09 Thread Akhil Das
You can read the installation details from here
http://spark.apache.org/docs/latest/

You can read about contributing to spark from here
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Thu, Oct 29, 2015 at 3:53 PM, Aaska Shah  wrote:

> Hello,my name is Aaska Shah and I am a second year undergrad student at
> DAIICT,Gandhinagar,India.
>
> I have quite lately been interested in contributing towards the open
> source organization and I find your organization the most appropriate one.
>
> I request you to please guide me through how to install your codebase and
> how to get started to your organization.
>
> Thanking You,
> Aaska Shah
>


Re: sample or takeSample or ??

2015-11-09 Thread Akhil Das
You can't create a new RDD by selecting few elements. A rdd.take(n),
takeSample etc are actions and it will trigger your entire pipeline to be
executed.
You can although do something like this i guess:

val sample_data = rdd.take(10)

val sample_rdd = sc.parallelize(sample_data)



Thanks
Best Regards

On Thu, Oct 29, 2015 at 10:45 AM, 张志强(旺轩)  wrote:

> How do I to get a NEW RDD that has a number of elements that I specified?
> Sample()? It has no the number parameter, takeSample() it returns as a list?
>
>
>
> Help, please.
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Did you find anything regarding the OOM in the executor logs?

Thanks
Best Regards

On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman <r...@totango.com> wrote:

> If they have a problem managing memory, wouldn't there should be a OOM?
> Why does AppClient throw a NPE?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing  the memory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman <r...@totango.com> wrote:
>>
>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>> reproduce so I can open a bug about it]
>>>
>>> Hi,
>>>
>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>> running every day.
>>>
>>> From time to time, the applications crash with the following error (see
>>> below)
>>> But at the same time (and also after that), other applications are
>>> running, so I can safely assume the master and workers are working.
>>>
>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>> actually a network error...)
>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>> to increase it? i see the values are hardcoded inside AppClient)
>>> 3. how to recover from this error?
>>>
>>>
>>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>>> logs/error.log
>>>   java.lang.NullPointerException NullPointerException
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>   at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>   at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>>> initializing SparkContext. ERROR
>>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>>   at org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>   at
>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>   at
>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>>   at
>>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>>
>>>
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>
>>
>


Re: Unable to run applications on spark in standalone cluster mode

2015-11-01 Thread Akhil Das
Can you paste the contents of your spark-env.sh file? Also would be good to
have a look at the /etc/hosts file. Cannot bind to the given ip address can
be resolved if you put the hostname instead of the ip address. Also make
sure the configuration (conf directory) across your cluster have the same
contents.

Thanks
Best Regards

On Mon, Oct 26, 2015 at 10:48 AM, Rohith P 
wrote:

> No.. the ./sbin/start-master.sh --ip option did not work... It is still the
> same error
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Unable-to-run-applications-on-spark-in-standalone-cluster-mode-tp14683p14779.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Guaranteed processing orders of each batch in Spark Streaming

2015-10-22 Thread Akhil Das
I guess the order is guaranteed unless you set
the spark.streaming.concurrentJobs to a higher number than 1.

Thanks
Best Regards

On Mon, Oct 19, 2015 at 12:28 PM, Renjie Liu 
wrote:

> Hi, all:
> I've read source code and it seems that there is no guarantee that the
> order of processing of each RDD is guaranteed since jobs are just submitted
> to a thread pool. I  believe that this is quite important in streaming
> since updates should be ordered.
>
>


Re: Too many executors are created

2015-10-11 Thread Akhil Das
For some reason the executors are getting killed,

15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
app-20150929120924-/24463 is now EXITED (Command exited with code 1)

Can you paste your spark-submit command? You can also look in the executor
logs and see whats going on.

Thanks
Best Regards

On Wed, Sep 30, 2015 at 12:53 AM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:

> Dear Spark developers,
>
>
>
> I have created a simple Spark application for spark submit. It calls a
> machine learning library from Spark MLlib that is executed in a number of
> iterations that correspond to the same number of task in Spark. It seems
> that Spark creates an executor for each task and then removes it. The
> following messages indicate this in my log:
>
>
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24463 is now RUNNING
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24463 is now EXITED (Command exited with code 1)
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Executor
> app-20150929120924-/24463 removed: Command exited with code 1
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 24463
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor added:
> app-20150929120924-/24464 on worker-20150929120330-16.111.35.101-46374 (
> 16.111.35.101:46374) with 12 cores
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150929120924-/24464 on hostPort 16.111.35.101:46374 with 12
> cores, 30.0 GB RAM
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24464 is now LOADING
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24464 is now RUNNING
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24464 is now EXITED (Command exited with code 1)
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Executor
> app-20150929120924-/24464 removed: Command exited with code 1
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 24464
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor added:
> app-20150929120924-/24465 on worker-20150929120330-16.111.35.101-46374 (
> 16.111.35.101:46374) with 12 cores
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150929120924-/24465 on hostPort 16.111.35.101:46374 with 12
> cores, 30.0 GB RAM
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24465 is now LOADING
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24465 is now EXITED (Command exited with code 1)
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Executor
> app-20150929120924-/24465 removed: Command exited with code 1
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 24465
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor added:
> app-20150929120924-/24466 on worker-20150929120330-16.111.35.101-46374 (
> 16.111.35.101:46374) with 12 cores
>
> 15/09/29 12:21:02 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150929120924-/24466 on hostPort 16.111.35.101:46374 with 12
> cores, 30.0 GB RAM
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24466 is now LOADING
>
> 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated:
> app-20150929120924-/24466 is now RUNNING
>
>
>
> It end up creating and removing thousands of executors. Is this a normal
> behavior?
>
>
>
> If I run the same code within spark-shell, this does not happen. Could you
> suggest what might be wrong in my setting?
>
>
>
> Best regards, Alexander
>


Re: using JavaRDD in spark-redis connector

2015-09-30 Thread Akhil Das
You can create a JavaRDD as normal and then call the .rdd() to get the RDD.

Thanks
Best Regards

On Mon, Sep 28, 2015 at 9:01 PM, Rohith P 
wrote:

> Hi all,
>   I am trying to work with spark-redis connector (redislabs) which
> requires all transactions between redis and spark be in RDD's. The language
> I am using is Java but the connector does not accept JavaRDD's .So I tried
> using Spark context in my code instead of JavaSparkContext. But when I
> wanted to create a RDD using sc.parallelize , it asks for some scala
> related
> parameters as opposed to lists in java when I tries to have both
> javaSparkContext and sparkcontext(for connector) then Multiple contexts
> cannot be opened was the error
>  The code that I have been trying 
>
>
> // initialize spark context
> private static RedisContext config() {
> conf = new SparkConf().setAppName("redis-jedis");
> sc2=new SparkContext(conf);
> RedisContext rc=new RedisContext(sc2);
> return rc;
>
> }
> //write to redis which requires the data to be in RDD
> private static void WriteUserTacticData(RedisContext rc, String
> userid,
> String tacticsId, String value) {
> hostTup= calling(redisHost,redisPort);
> String key=userid+"-"+tacticsId;
> RDD>
> newTup=createTuple(key,value);
> rc.toRedisKV(newTup,hostTup);
>
> // the createTuple where the RDD is to be created which will be inserted
> into redis
> private static RDD> createTuple(String
> key,
> String value) {
> sc=new JavaSparkContext(conf);
> ArrayList> list= new
> ArrayList>();
> Tuple2 e= new Tuple2 String>(key,value);
> list.add(e);
> JavaRDD> javardd=
> sc.parallelize(list);
> RDD>
> newTupRdd=JavaRDD.toRDD(javardd);
> sc.close();
> return newTupRdd;
> }
>
>
>
> How would I create an RDD(not javaRDD) in java which will be accepted by
> redis connector... Any kind of related to the topic would be
> appretiated..
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/using-JavaRDD-in-spark-redis-connector-tp14391.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: unsubscribe

2015-09-25 Thread Akhil Das
Send an email to dev-unsubscr...@spark.apache.org instead of
dev@spark.apache.org

Thanks
Best Regards

On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumar 
wrote:

>


Re: Spark Streaming..Exception

2015-09-14 Thread Akhil Das
You should consider upgrading your spark from 1.3.0 to a higher version.

Thanks
Best Regards

On Mon, Sep 14, 2015 at 2:28 PM, Priya Ch 
wrote:

> Hi All,
>
>  I came across the related old conversation on the above issue (
> https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed? I
> tried different values for spark.cleaner.ttl  -> 0sec, -1sec,
> 2000sec,..none of them worked. I also tried setting
> spark.streaming.unpersist -> true. What is the possible solution for this ?
> Is this a bug in Spark 1.3.0? Changing the scheduling mode to Stand-alone
> or Mesos mode would work fine ??
>
> Please someone share your views on this.
>
> On Sat, Sep 12, 2015 at 11:04 PM, Priya Ch 
> wrote:
>
>> Hello All,
>>
>>  When I push messages into kafka and read into streaming application, I
>> see the following exception-
>>  I am running the application on YARN and no where broadcasting the
>> message within the application. Just simply reading message, parsing it and
>> populating fields in a class and then printing the dstream (using
>> DStream.print).
>>
>>  Have no clue if this is cluster issue or spark version issue or node
>> issue. The strange part is, sometimes the message is processed but
>> sometimes I see the below exception -
>>
>> java.io.IOException: org.apache.spark.SparkException: Failed to get
>> broadcast_5_piece0 of broadcast_5
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
>> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.spark.SparkException: Failed to get
>> broadcast_5_piece0 of broadcast_5
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>> at scala.Option.getOrElse(Option.scala:120)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at org.apache.spark.broadcast.TorrentBroadcast.org
>> 
>> $apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1152)
>>
>>
>> I would be glad if someone can throw some light on this.
>>
>> Thanks,
>> Padma Ch
>>
>>
>


Re: Detecting configuration problems

2015-09-08 Thread Akhil Das
I found an old JIRA referring the same.
https://issues.apache.org/jira/browse/SPARK-5421

Thanks
Best Regards

On Sun, Sep 6, 2015 at 8:53 PM, Madhu  wrote:

> I'm not sure if this has been discussed already, if so, please point me to
> the thread and/or related JIRA.
>
> I have been running with about 1TB volume on a 20 node D2 cluster (255
> GiB/node).
> I have uniformly distributed data, so skew is not a problem.
>
> I found that default settings (or wrong setting) for driver and executor
> memory caused out of memory exceptions during shuffle (subtractByKey to be
> exact). This was not easy to track down, for me at least.
>
> Once I bumped up driver to 12G and executor to 10G with 300 executors and
> 3000 partitions, shuffle worked quite well (12 mins for subtractByKey). I'm
> sure there are more improvement to made, but it's a lot better than heap
> space exceptions!
>
> From my reading, the shuffle OOM problem is in ExternalAppendOnlyMap or
> similar disk backed collection.
> I have some familiarity with that code based on previous work with external
> sorting.
>
> Is it possible to detect misconfiguration that leads to these OOMs and
> produce a more meaningful error messages? I think that would really help
> users who might not understand all the inner workings and configuration of
> Spark (myself included). As it is, heap space issues are a challenge and
> does not present Spark in a positive light.
>
> I can help with that effort if someone is willing to point me to the
> precise
> location of memory pressure during shuffle.
>
> Thanks!
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: OOM in spark driver

2015-09-04 Thread Akhil Das
Or you can increase the driver heap space (export _JAVA_OPTIONS="-Xmx5g")

Thanks
Best Regards

On Wed, Sep 2, 2015 at 11:57 PM, Mike Hynes <91m...@gmail.com> wrote:

> Just a thought; this has worked for me before on standalone client
> with a similar OOM error in a driver thread. Try setting:
> export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your
> machine
> in your environment/spark-env.sh before running spark-submit.
> Mike
>
> On 9/2/15, ankit tyagi  wrote:
> > Hi All,
> >
> > I am using spark-sql 1.3.1 with hadoop 2.4.0 version.  I am running sql
> > query against parquet files and wanted to save result on s3 but looks
> like
> > https://issues.apache.org/jira/browse/SPARK-2984 problem still coming
> while
> > saving data to s3.
> >
> > Hence Now i am saving result on hdfs and with the help
> > of JavaSparkListener, copying file from hdfs to s3 with hadoop fileUtil
> > in onApplicationEnd method. But  my job is getting failed with OOM in
> spark
> > driver.
> >
> > *5/09/02 04:17:57 INFO cluster.YarnClusterSchedulerBackend: Asking each
> > executor to shut down*
> > *15/09/02 04:17:59 INFO
> > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor:
> > OutputCommitCoordinator stopped!*
> > *Exception in thread "Reporter" *
> > *Exception: java.lang.OutOfMemoryError thrown from the
> > UncaughtExceptionHandler in thread "Reporter"*
> > *Exception in thread "SparkListenerBus" *
> > *Exception: java.lang.OutOfMemoryError thrown from the
> > UncaughtExceptionHandler in thread "SparkListenerBus"*
> > *Exception in thread "Driver" *
> > *Exception: java.lang.OutOfMemoryError thrown from the
> > UncaughtExceptionHandler in thread "Driver"*
> >
> >
> > Strage part is, result is getting saved on HDFS but while copying file
> job
> > is getting failed. size of file is under 1MB.
> >
> > Any help or leads would be appreciated.
> >
>
>
> --
> Thanks,
> Mike
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: IOError on createDataFrame

2015-08-31 Thread Akhil Das
Why not attach a bigger hard disk to the machines and point your
SPARK_LOCAL_DIRS to it?

Thanks
Best Regards

On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti 
wrote:

> Hello,
>
> Similar to the thread below [1], when I tried to create an RDD from a 4GB
> pandas dataframe I encountered the error
>
> TypeError: cannot create an RDD from type: 
>
> However looking into the code shows this is raised from a generic "except
> Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
> debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
> space:
>
> -> rdd = self._sc.parallelize(data)
> (Pdb)
> *IOError: (28, 'No space left on device')*
>
> In this case, creating an RDD from a large matrix (~50mill rows) is
> required
> for us. I'm a bit concerned about spark's process here:
>
>a. turning the dataframe into records (data.to_records)
>b. writing it to tmp
>c. reading it back again in scala.
>
> Is there a better way? The intention would be to operate on slices of this
> large dataframe using numpy operations via spark's transformations and
> actions.
>
> Thanks,
> FDS
>
> 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread Akhil Das
You can add it to the spark packages i guess http://spark-packages.org/

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote:

 Sorry for previous line-breaking format, try to resend the mail again.

 I have written a sbt plugin called spark-deployer, which is able to deploy
 a standalone spark cluster on aws ec2 and submit jobs to it.
 https://github.com/pishen/spark-deployer

 Compared to current spark-ec2 script, this design may have several
 benefits (features):
 1. All the code are written in Scala.
 2. Just add one line in your project/plugins.sbt and you are ready to go.
 (You don't have to download the python code and store it at someplace.)
 3. The whole development flow (write code for spark job, compile the code,
 launch the cluster, assembly and submit the job to master, terminate the
 cluster when the job is finished) can be done in sbt.
 4. Support parallel deployment of the worker machines by Scala's Future.
 5. Allow dynamically add or remove worker machines to/from the current
 cluster.
 6. All the configurations are stored in a typesafe config file. You don't
 need to store it elsewhere and map the settings into spark-ec2's command
 line arguments.
 7. The core library is separated from sbt plugin, hence it's possible to
 execute the deployment from an environment without sbt (only JVM is
 required).
 8. Support adjustable ec2 root disk size, custom security groups, custom
 ami (can run on default Amazon ami), custom spark tarball, and VPC. (Well,
 most of these are also supported in spark-ec2 in slightly different form,
 just mention it anyway.)

 Since this project is still in its early stage, it lacks some features of
 spark-ec2 such as self-installed HDFS (we use s3 directly), stoppable
 cluster, ganglia, and the copy script.
 However, it's already usable for our company and we are trying to move our
 production spark projects from spark-ec2 to spark-deployer.

 Any suggestion, testing help, or pull request are highly appreciated.

 On top of that, I would like to contribute this project to Spark, maybe as
 another choice (suggestion link) alongside spark-ec2 on Spark's official
 documentation.
 Of course, before that, I have to make this project stable enough (strange
 errors just happen on aws api from time to time).
 I'm wondering if this kind of contribution is possible and is there any
 rule to follow or anyone to contact?
 (Maybe the source code will not be merged into spark's main repository,
 since I've noticed that spark-ec2 is also planning to move out.)

 Regards,
 Pishen Tsai




Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which involves shuffle operations ends
up in an infinite loop. Spark should somehow indicate this instead of going
in an infinite loop.

Thanks
Best Regards

On Thu, Aug 13, 2015 at 11:37 PM, Imran Rashid iras...@cloudera.com wrote:

 oh I see, you are defining your own RDD  Partition types, and you had a
 bug where partition.index did not line up with the partitions slot in
 rdd.getPartitions.  Is that correct?

 On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 I figured that out, And these are my findings:

 - It just enters in an infinite loop when there's a duplicate partition
 id.

 - It enters in an infinite loop when the partition id starts from 1
 rather than 0


 Something like this piece of code can reproduce it: (in getPartitions())

 val total_partitions = 4
 val partitionsArray: Array[Partition] =
 Array.ofDim[Partition](total_partitions)

 var i = 0

 for(outer - 0 to 1){
   for(partition - 1 to total_partitions){
 partitionsArray(i) = new DeadLockPartitions(partition)
 i = i + 1
   }
 }

 partitionsArray




 Thanks
 Best Regards

 On Wed, Aug 12, 2015 at 10:57 PM, Imran Rashid iras...@cloudera.com
 wrote:

 yikes.

 Was this a one-time thing?  Or does it happen consistently?  can you
 turn on debug logging for o.a.s.scheduler (dunno if it will help, but maybe
 ...)

 On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Hi

 My Spark job (running in local[*] with spark 1.4.1) reads data from a
 thrift server(Created an RDD, it will compute the partitions in
 getPartitions() call and in computes hasNext will return records from these
 partitions), count(), foreach() is working fine it returns the correct
 number of records. But whenever there is shuffleMap stage (like reduceByKey
 etc.) then all the tasks are executing properly but it enters in an
 infinite loop saying :


1. 15/08/11 13:05:54 INFO DAGScheduler: Resubmitting
ShuffleMapStage 1 (map at FilterMain.scala:59) because some of its
tasks had failed: 0, 3


 Here's the complete stack-trace http://pastebin.com/hyK7cG8S

 What could be the root cause of this problem? I looked up and bumped
 into this closed JIRA https://issues.apache.org/jira/browse/SPARK-583
 (which is very very old)




 Thanks
 Best Regards







Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Thanks for the clarifications Mrithul.

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 What I understood from Imran's mail (and what was referenced in his
 mail) the RDD mentioned seems to be violating some basic contracts on
 how partitions are used in spark [1].
 They cannot be arbitrarily numbered,have duplicates, etc.


 Extending RDD to add functionality is typically for niche cases; and
 requires subclasses to adhere to the explicit (and implicit)
 contracts/lifecycles for them.
 Using existing RDD's as template would be a good idea for
 customizations - one way to look at it is, using RDD is more in api
 space but extending them is more in spi space.

 Violations would actually not even be detectable by spark-core in general
 case.


 Regards,
 Mridul

 [1] Ignoring the array out of bounds, etc - I am assuming the intent
 is to show overlapping partitions, duplicates. index to partition
 mismatch - that sort of thing.


 On Thu, Aug 13, 2015 at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Yep, and it works fine for operations which does not involve any shuffle
  (like foreach,, count etc) and those which involves shuffle operations
 ends
  up in an infinite loop. Spark should somehow indicate this instead of
 going
  in an infinite loop.
 
  Thanks
  Best Regards
 
  On Thu, Aug 13, 2015 at 11:37 PM, Imran Rashid iras...@cloudera.com
 wrote:
 
  oh I see, you are defining your own RDD  Partition types, and you had a
  bug where partition.index did not line up with the partitions slot in
  rdd.getPartitions.  Is that correct?
 
  On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:
 
  I figured that out, And these are my findings:
 
  - It just enters in an infinite loop when there's a duplicate
 partition
  id.
 
  - It enters in an infinite loop when the partition id starts from 1
  rather than 0
 
 
  Something like this piece of code can reproduce it: (in
 getPartitions())
 
  val total_partitions = 4
  val partitionsArray: Array[Partition] =
  Array.ofDim[Partition](total_partitions)
 
  var i = 0
 
  for(outer - 0 to 1){
for(partition - 1 to total_partitions){
  partitionsArray(i) = new DeadLockPartitions(partition)
  i = i + 1
}
  }
 
  partitionsArray
 
 
 
 
  Thanks
  Best Regards
 
  On Wed, Aug 12, 2015 at 10:57 PM, Imran Rashid iras...@cloudera.com
  wrote:
 
  yikes.
 
  Was this a one-time thing?  Or does it happen consistently?  can you
  turn on debug logging for o.a.s.scheduler (dunno if it will help, but
 maybe
  ...)
 
  On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das 
 ak...@sigmoidanalytics.com
  wrote:
 
  Hi
 
  My Spark job (running in local[*] with spark 1.4.1) reads data from a
  thrift server(Created an RDD, it will compute the partitions in
  getPartitions() call and in computes hasNext will return records
 from these
  partitions), count(), foreach() is working fine it returns the
 correct
  number of records. But whenever there is shuffleMap stage (like
 reduceByKey
  etc.) then all the tasks are executing properly but it enters in an
 infinite
  loop saying :
 
  15/08/11 13:05:54 INFO DAGScheduler: Resubmitting ShuffleMapStage 1
  (map at FilterMain.scala:59) because some of its tasks had failed:
 0, 3
 
 
  Here's the complete stack-trace http://pastebin.com/hyK7cG8S
 
  What could be the root cause of this problem? I looked up and bumped
  into this closed JIRA (which is very very old)
 
 
 
 
  Thanks
  Best Regards
 
 
 
 
 



Re: Switch from Sort based to Hash based shuffle

2015-08-13 Thread Akhil Das
Have a look at spark.shuffle.manager, You can switch between sort and hash
with this configuration.

spark.shuffle.managersortImplementation to use for shuffling data. There
are two implementations available:sort and hash. Sort-based shuffle is more
memory-efficient and is the default option starting in 1.2.

Thanks
Best Regards

On Thu, Aug 13, 2015 at 2:56 PM, cheez 11besemja...@seecs.edu.pk wrote:

 I understand that the current master branch of Spark uses Sort based
 shuffle.
 Is there a way to change that to Hash based shuffle, just for experimental
 purposes by modifying the source code ?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Switch-from-Sort-based-to-Hash-based-shuffle-tp13661.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Pushing Spark to 10Gb/s

2015-08-11 Thread Akhil Das
Hi Starch,

It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message size and
higher number of partitions (in kafka) will utilize the network properly.
With this combination, we have operated few pipelines running at 10Gb/s (~
1GB/s ).

Thanks
Best Regards

On Tue, Aug 11, 2015 at 12:24 AM, Starch, Michael D (398M) 
michael.d.sta...@jpl.nasa.gov wrote:

 All,

 I am trying to get data moving in and out of spark at 10Gb/s. I currently
 have a very powerful cluster to work on, offering 40Gb/s inifiniband links
 so I believe the network pipe should be fast enough.

 Has anyone gotten spark operating at high data rates before? Any advice
 would be appreciated.

 -Michael Starch
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:

 Dear Sir / Madam,

 I have a plan to contribute some codes about passing filters to a
 datasource as physical planning.

 In more detail, I understand when we want to build up filter operations
 from data like Parquet (when actually reading and filtering HDFS blocks at
 first not filtering in memory with Spark operations), we need to implement

 PrunedFilteredScan, PrunedScan or CatalystScan in package
 org.apache.spark.sql.sources.



 For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
 org.apache.spark.sql.sources, which do not access directly to the query
 parser but are objects built by selectFilters() in package
 org.apache.spark.sql.sources.DataSourceStrategy.

 It looks all the filters (rather raw expressions) do not pass to the
 function below in PrunedFilteredScan and PrunedScan.

 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]

 The passing filters in here are defined in package
 org.apache.spark.sql.sources.

 On the other hand, it does not pass EqualNullSafe filter in package
 org.apache.spark.sql.catalyst.expressions even though this looks possible
 to pass for other datasources such as Parquet and JSON.



 I understand that  CatalystScan can take the all raw expression accessing
 to the query planner. However, it is experimental and also it needs
 different interfaces (as well as unstable for the reasons such as binary
 capability).

 As far as I know, Parquet also does not use this.



 In general, this can be a issue as a user send a query to data such as

 1.

 SELECT *
 FROM table
 WHERE field = 1;


 2.

 SELECT *
 FROM table
 WHERE field = 1;


 The second query can be hugely slow because of large network traffic by
 not filtered data from the source RDD.



 Also,I could not find a proper issue for this (except for
 https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
 now binary capability.

 Accordingly, I want to add this issue and make a pull request with my
 codes.


 Could you please make any comments for this?

 Thanks.




Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi

My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of records. But whenever there is shuffleMap stage (like reduceByKey
etc.) then all the tasks are executing properly but it enters in an
infinite loop saying :


   1. 15/08/11 13:05:54 INFO DAGScheduler: Resubmitting ShuffleMapStage 1 (map
   at FilterMain.scala:59) because some of its tasks had failed: 0, 3


Here's the complete stack-trace http://pastebin.com/hyK7cG8S

What could be the root cause of this problem? I looked up and bumped into
this closed JIRA https://issues.apache.org/jira/browse/SPARK-583 (which
is very very old)




Thanks
Best Regards


Re: How to help for 1.5 release?

2015-08-04 Thread Akhil Das
I think you can start from here
https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel

Thanks
Best Regards

On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:

 I think the team is preparing for the 1.5 release. Anything to help with
 the QA, testing etc?

 Thanks,

 MW




Re: ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-28 Thread Akhil Das
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).

Thanks
Best Regards

On Sat, Jul 25, 2015 at 3:29 PM, anshu shukla anshushuk...@gmail.com
wrote:

 My eventGen is emitting 20,000  events/sec ,and I am using store(s1)  in 
 receive()  method to push data to receiverStream .

 But this logic is working fine for upto 4000 events/sec and no batch are seen 
 emitting for larger rate .

 *CODE:TOPOLOGY -*


 *JavaDStreamString sourcestream = ssc.receiverStream(new 
 TetcCustomEventReceiver(datafilename,spoutlog,argumentClass.getScalingFactor(),datasetType));*

 *CODE:TetcCustomEventReceiver -*

 public void receive(ListString event) {
 StringBuffer tuple=new StringBuffer();
 msgId++;
 for(String s:event)
 {
 tuple.append(s).append(,);
 }
 String s1=MsgIdAddandRemove.addMessageId(tuple.toString(),msgId);
 store(s1);
 }




 --
 Thanks  Regards,
 Anshu Shukla



Re: RestSubmissionClient Basic Auth

2015-07-16 Thread Akhil Das
You can possibly raise a JIRA ticket for feature and start working on it,
once done you can send a pull request with the code changes.

Thanks
Best Regards

On Wed, Jul 15, 2015 at 7:30 PM, Joel Zambrano jo...@microsoft.com wrote:

  Thanks Akhil! For the one where I change the rest client, how likely
 would it be that a change like that goes thru? Would it be rejected as an
 uncommon scenario? I really don't want to have this as a separate form of
 the branch.

 Thanks,
 Joel
 --
 *From:* Akhil Das ak...@sigmoidanalytics.com
 *Sent:* Wednesday, July 15, 2015 2:07:08 AM
 *To:* Joel Zambrano
 *Cc:* dev@spark.apache.org
 *Subject:* Re: RestSubmissionClient Basic Auth

   Either way is fine. Relay proxy would be much easier, adding
 authentication to the REST client would require you to rebuild and test the
 piece of code that you wrote for authentication.

  Thanks
 Best Regards

 On Wed, Jul 15, 2015 at 4:51 AM, Joel Zambrano jo...@microsoft.com
 wrote:

  Hi! We have a gateway with basic auth that relays calls to the head
 node in our cluster. Is adding support for basic auth the wrong approach?
 Should we use a relay proxy? I’ve seen the code and it would probably
 require adding a few configs and appending the header on the get and post
 request of the REST submission client.



 A best strategy would be greatly appreciated.



 Thanks,

 Joel





Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
This will get you started
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71 
sreenivas.raghav...@gmail.com wrote:

 Hello everyone,
I am interested to contribute to apache spark. I am more
 inclined towards algorithms and computational methods for matrices,etc. I
 took one course in edx where spark was taught through python interface. So
 My doubts are as follows:

 1. Place from where I can start working.
 2. Language for coding - Is using python okay, or Is there any fixed
 language I should Use?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributiona-nd-choice-of-langauge-tp13179.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
You can try to resolve some Jira issues, to start with try out some newbie
JIRA's.

Thanks
Best Regards

On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71 
sreenivas.raghav...@gmail.com wrote:

 I saw the contribution sections. As a new contibutor, should I try to build
 patches or can I add some new algorithm to MLlib. I am comfortable with
 python and R. Are they enough to contribute for spark?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributiona-nd-choice-of-langauge-tp13179p13209.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark job hangs when History server events are written to hdfs

2015-07-08 Thread Akhil Das
Can you look in the datanode logs and see whats going on? Most likely, you
are hitting the ulimit on open file handles.

Thanks
Best Regards

On Wed, Jul 8, 2015 at 10:55 AM, Pankaj Arora pankaj.ar...@guavus.com
wrote:

  Hi,

  I am running long running application over yarn using spark and I am
 facing issues while using spark’s history server when the events are
 written to hdfs. It seems to work fine for some time and in between I see
 following exception.

  2015-06-01 00:00:03,247 [SparkListenerBus] ERROR
 org.apache.spark.scheduler.LiveListenerBus - Listener EventLoggingListener
 threw an exception

 java.lang.reflect.InvocationTargetException

 at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at
 org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)

 at
 org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)

 at scala.Option.foreach(Option.scala:236)

 at org.apache.spark.util.FileLogger.flush(FileLogger.scala:203)

 at
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90)

 at
 org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121)

 at
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)

 at
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)

 at
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)

 at
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)

 at
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66)

 at
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)

 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1545)

 at
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)

 Caused by: java.io.IOException: All datanodes 192.168.162.54:50010 are
 bad. Aborting...

 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)

 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

 at
 org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)



  And after that this error continue to come and spark reaches into
 unstable stage where no job is able to progress.

  FYI.
 HDFS was up and running before and after this error and on restarting
 application it runs fine for some hours and again same error comes.
 Enough disk space was available on each data node.

  Any suggestion or help would be appreciated.

  Regards
 Pankaj




Re: Data interaction between various RDDs in Spark Streaming

2015-07-07 Thread Akhil Das
UpdatestateByKey?

Thanks
Best Regards

On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote:

 Hi,

 Suppose I want the data to be grouped by and Id named 12345 and I have
 certain amount of data coming out from one batch for 12345 and I have
 data
 related to 12345 coming after 5 hours, how do I group by 12345 and have
 a single RDD of list?

 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Data-interaction-between-various-RDDs-in-Spark-Streaming-tp13058.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with
almost all those databases out there (You just need to pass the
Input/Output Format classes or there are a bunch of connectors also
available).

Thanks
Best Regards

On Fri, Jun 26, 2015 at 12:07 PM, louis.hust louis.h...@gmail.com wrote:

 Hi, all

 For now, spark is based on hadoop, I want use a database cluster instead
 of the hadoop,
 so data distributed on each database in the cluster.

 I want to know if spark suitable for this situation ?

 Any idea will be appreciated!


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: About HostName display in SparkUI

2015-06-15 Thread Akhil Das
In the conf/slaves file, are you having the ip addresses? or the hostnames?

Thanks
Best Regards

On Sat, Jun 13, 2015 at 9:51 PM, Sea 261810...@qq.com wrote:

 In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0),
 why? who did it?




Re: Contribution

2015-06-13 Thread Akhil Das
This is a good start, if you haven't seen this already
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 
sreenivas.raghav...@gmail.com wrote:

 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to
 introduce
 the notion of neural network in the apache spark



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: About akka used in spark

2015-06-10 Thread Akhil Das
If you look at the maven repo, you can see its from typesafe only
http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark

For sbt, you can download the sources by adding withSources() like:

libraryDependencies += org.spark-project.akka % akka-actor_2.10 %
2.3.4-spark withSources() withJavadoc()



Thanks
Best Regards

On Wed, Jun 10, 2015 at 11:25 AM, wangtao (A) wangtao...@huawei.com wrote:

  Hi guys,



 I see group id of akka used in spark is “org.spark-project.akka”. What is
 its difference with the typesafe one? What is its version? And where can we
 get the source code?



 Regards.



Re: Scheduler question: stages with non-arithmetic numbering

2015-06-07 Thread Akhil Das
Are you seeing the same behavior on the driver UI? (that running on port
4040), If you click on the stage id header you can sort the stages based on
IDs.

Thanks
Best Regards

On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote:

 Hi folks,

 When I look at the output logs for an iterative Spark program, I see
 that the stage IDs are not arithmetically numbered---that is, there
 are gaps between stages and I might find log information about Stage
 0, 1,2, 5, but not 3 or 4.

 As an example, the output from the Spark logs below shows what I mean:

 # grep -rE Stage [[:digit:]]+ spark_stderr  | grep finished
 12048:INFO:DAGScheduler:Stage 0 (mapPartitions at blockMap.scala:1444)
 finished in 7.820 s:
 15994:INFO:DAGScheduler:Stage 1 (map at blockMap.scala:1810) finished
 in 3.874 s:
 18291:INFO:DAGScheduler:Stage 2 (count at blockMap.scala:1179)
 finished in 2.237 s:
 20121:INFO:DAGScheduler:Stage 4 (map at blockMap.scala:1817) finished
 in 1.749 s:
 21254:INFO:DAGScheduler:Stage 5 (count at blockMap.scala:1180)
 finished in 1.082 s:
 23422:INFO:DAGScheduler:Stage 7 (map at blockMap.scala:1810) finished
 in 2.078 s:
 24773:INFO:DAGScheduler:Stage 8 (count at blockMap.scala:1188)
 finished in 1.317 s:
 26455:INFO:DAGScheduler:Stage 10 (map at blockMap.scala:1817) finished
 in 1.638 s:
 27228:INFO:DAGScheduler:Stage 11 (count at blockMap.scala:1189)
 finished in 0.732 s:
 27494:INFO:DAGScheduler:Stage 14 (foreach at blockMap.scala:1302)
 finished in 0.192 s:
 27709:INFO:DAGScheduler:Stage 17 (foreach at blockMap.scala:1302)
 finished in 0.170 s:
 28018:INFO:DAGScheduler:Stage 20 (count at blockMap.scala:1201)
 finished in 0.270 s:
 28611:INFO:DAGScheduler:Stage 23 (map at blockMap.scala:1355) finished
 in 0.455 s:
 29598:INFO:DAGScheduler:Stage 24 (count at blockMap.scala:274)
 finished in 0.928 s:
 29954:INFO:DAGScheduler:Stage 27 (map at blockMap.scala:1355) finished
 in 0.305 s:
 30390:INFO:DAGScheduler:Stage 28 (count at blockMap.scala:275)
 finished in 0.391 s:
 30452:INFO:DAGScheduler:Stage 32 (first at
 MatrixFactorizationModel.scala:60) finished in 0.028 s:
 30506:INFO:DAGScheduler:Stage 36 (first at
 MatrixFactorizationModel.scala:60) finished in 0.023 s:

 Can anyone comment on this being normal behavior? Is it indicative of
 faults causing stages to be resubmitted? I also cannot find the
 missing stages in any stage's parent List(Stage x, Stage y, ...)

 Thanks,
 Mike


 On 6/1/15, Reynold Xin r...@databricks.com wrote:
  Thanks, René. I actually added a warning to the new JDBC reader/writer
  interface for 1.4.0.
 
  Even with that, I think we should support throttling JDBC; otherwise it's
  too convenient for our users to DOS their production database servers!
 
 
/**
 * Construct a [[DataFrame]] representing the database table accessible
  via JDBC URL
 * url named table. Partitions of the table will be retrieved in
 parallel
  based on the parameters
 * passed to this function.
 *
  *   * Don't create too many partitions in parallel on a large cluster;
  otherwise Spark might crash*
  *   * your external database systems.*
 *
 * @param url JDBC database url of the form `jdbc:subprotocol:subname`
 * @param table Name of the table in the external database.
 * @param columnName the name of a column of integral type that will be
  used for partitioning.
 * @param lowerBound the minimum value of `columnName` used to decide
  partition stride
 * @param upperBound the maximum value of `columnName` used to decide
  partition stride
 * @param numPartitions the number of partitions.  the range
  `minValue`-`maxValue` will be split
 *  evenly into this many partitions
 * @param connectionProperties JDBC database connection arguments, a
 list
  of arbitrary string
 * tag/value. Normally at least a user
 and
  password property
 * should be included.
 *
 * @since 1.4.0
 */
 
 
  On Mon, Jun 1, 2015 at 1:54 AM, René Treffer rtref...@gmail.com wrote:
 
  Hi,
 
  I'm using sqlContext.jdbc(uri, table, where).map(_ =
  1).aggregate(0)(_+_,_+_) on an interactive shell (where where is an
  Array[String] of 32 to 48 elements).  (The code is tailored to your db,
  specifically through the where conditions, I'd have otherwise post it)
  That should be the DataFrame API, but I'm just trying to load everything
  and discard it as soon as possible :-)
 
  (1) Never do a silent drop of the values by default: it kills
 confidence.
  An option sounds reasonable.  Some sort of insight / log would be great.
  (How many columns of what type were truncated? why?)
  Note that I could declare the field as string via JdbcDialects (thank
 you
  guys for merging that :-) ).
  I have quite bad experiences with silent drops / truncates of columns
 and
  thus _like_ the strict way of spark. It causes trouble but noticing
 later
  that your data was corrupted during 

Re: Resource usage of a spark application

2015-05-21 Thread Akhil Das
Yes Peter that's correct, you need to identify the processes and with that
you can pull the actual usage metrics.

Thanks
Best Regards

On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer 
peter.prettenho...@gmail.com wrote:

 Thanks Akhil, Ryan!

 @Akhil: YARN can only tell me how much vcores my app has been granted but
 not actual cpu usage, right? Pulling mem/cpu usage from the OS means i need
 to map JVM executor processes to the context they belong to, right?

 @Ryan: what a great blog post -- this is super relevant for me to analyze
 the state of the cluster as a whole. However, it seems to me that those
 metrics are mostly reported globally and not per spark application.

 2015-05-19 21:43 GMT+02:00 Ryan Williams ryan.blake.willi...@gmail.com:

 Hi Peter, a few months ago I was using MetricsSystem to export to
 Graphite and then view in Grafana; relevant scripts and some
 instructions are here
 https://github.com/hammerlab/grafana-spark-dashboards/ if you want to
 take a look.


 On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 Hi all,

 I'm looking for a way to measure the current memory / cpu usage of a
 spark application to provide users feedback how much resources are actually
 being used.
 It seems that the metric system provides this information to some
 extend. It logs metrics on application level (nr of cores granted) and on
 the JVM level (memory usage).
 Is this the recommended way to gather this kind of information? If so,
 how do i best map a spark application to the corresponding JVM processes?

 If not, should i rather request this information from the resource
 manager (e.g. Mesos/YARN)?

 thanks,
  Peter

 --
 Peter Prettenhofer




 --
 Peter Prettenhofer



Re: Resource usage of a spark application

2015-05-17 Thread Akhil Das
You can either pull the high level information from your resource manager,
or if you want more control/specific information you can write a script and
pull the resource usage information from the OS. Something like this
http://www.itsprite.com/linux3-shell-scripts-to-monitor-the-process-resource-in-linux/
will help.

Thanks
Best Regards

On Sun, May 17, 2015 at 6:18 PM, Peter Prettenhofer 
peter.prettenho...@gmail.com wrote:

 Hi all,

 I'm looking for a way to measure the current memory / cpu usage of a spark
 application to provide users feedback how much resources are actually being
 used.
 It seems that the metric system provides this information to some extend.
 It logs metrics on application level (nr of cores granted) and on the JVM
 level (memory usage).
 Is this the recommended way to gather this kind of information? If so, how
 do i best map a spark application to the corresponding JVM processes?

 If not, should i rather request this information from the resource manager
 (e.g. Mesos/YARN)?

 thanks,
  Peter

 --
 Peter Prettenhofer



Re: s3 vfs on Mesos Slaves

2015-05-13 Thread Akhil Das
Did you happened to have a look at this https://github.com/abashev/vfs-s3

Thanks
Best Regards

On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com
wrote:

 We have a small mesos cluster and these slaves need to have a vfs setup on
 them so that the slaves can pull down the data they need from S3 when spark
 runs.

 There doesn’t seem to be any obvious way online on how to do this or how
 easily accomplish this. Does anyone have some best practices or some ideas
 about how to accomplish this?

 An example stack trace when a job is ran on the mesos cluster…

 Any idea how to get this going? Like somehow bootstrapping spark on run or
 something?

 Thanks,
 Steve


 java.io.IOException: Unsupported scheme s3n for URI s3n://removed
 at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
 at
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
 1)
 java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
 for URI s3n://removed
 at
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: Unsupported scheme s3n for URI
 s3n://removed
 at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
 at
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
 at
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
 ... 8 more

 This e-mail is intended solely for the above-mentioned recipient and it
 may contain confidential or privileged information. If you have received it
 in error, please notify us immediately and delete the e-mail. You must not
 copy, distribute, disclose or take any action in reliance on it. In
 addition, the contents of an attachment to this e-mail may contain software
 viruses which could damage your own computer system. While ColdLight
 Solutions, LLC has taken every reasonable precaution to minimize this risk,
 we cannot accept liability for any damage which you sustain as a result of
 software viruses. You should perform your own virus checks before opening
 the attachment.



Re: Getting Access is denied error while cloning Spark source using Eclipse

2015-05-12 Thread Akhil Das
May be you should check where exactly its throwing up permission denied
(possibly trying to write to some directory). Also you can try manually
cloning the git repo to a directory and then try opening that in eclipse.

Thanks
Best Regards

On Tue, May 12, 2015 at 3:46 PM, Chandrashekhar Kotekar 
shekhar.kote...@gmail.com wrote:

 Hi,

 I am  trying to clone Spark source using Eclipse. After providing spark
 source URL, eclipse downloads some code which I can see in download
 location but as soon as downloading reaches 99% Eclipse throws Gi
 repository clone failed. Access is denied error.

 Has anyone encountered such a problem? I want to contribute to Apache spark
 source code and I am newbie, first time trying to contribute to open source
 project. Can anyone please help me in solving this error?

 Regards,
 Chandrash3khar Kotekar
 Mobile - +91 8600011455



Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das
Looks like the jar you provided has some missing classes. Try this:

scalaVersion := 2.10.4

libraryDependencies ++= Seq(
org.apache.spark %% spark-core % 1.3.0,
org.apache.spark %% spark-sql % 1.3.0 % provided,
org.apache.spark %% spark-mllib % 1.3.0 % provided,
log4j % log4j % 1.2.15 excludeAll(
  ExclusionRule(organization = com.sun.jdmk),
  ExclusionRule(organization = com.sun.jmx),
  ExclusionRule(organization = javax.jms)
  )
)


Thanks
Best Regards

On Thu, May 7, 2015 at 11:28 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 Hi all – I’m attempting to build a project with SBT and run it on Spark
 1.3 (this previously worked before we upgraded to CDH 5.4 with Spark 1.3).

 I have the following in my build.sbt:


 scalaVersion := 2.10.4

 libraryDependencies ++= Seq(
 org.apache.spark %% spark-core % 1.3.0 % provided,
 org.apache.spark %% spark-sql % 1.3.0 % provided,
 org.apache.spark %% spark-mllib % 1.3.0 % provided,
 log4j % log4j % 1.2.15 excludeAll(
   ExclusionRule(organization = com.sun.jdmk),
   ExclusionRule(organization = com.sun.jmx),
   ExclusionRule(organization = javax.jms)
   )
 )

 When I attempt to run this program with sbt run, however, I get the
 following error:
 java.lang.NoClassDefFoundError: org.apache.spark.Partitioner

 I don’t explicitly use the Partitioner class anywhere, and this seems to
 indicate some missing Spark libraries on the install. Do I need to confirm
 anything other than the presence of the Spark assembly? I’m on CDH 5.4 and
 I’m able to run the spark-shell without any trouble.

 Any help would be much appreciated.

 Thank you,
 Ilya Ganelin


 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.



Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das
We had a similar issue while working on one of our usecase where we were
processing at a moderate throughput (around 500MB/S). When the processing
time exceeds the batch duration, it started to throw up blocknotfound
exceptions, i made a workaround for that issue and is explained over here
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkStreaming-Workaround-for-BlockNotFound-Exceptions-td12096.html

Basically, instead of generating blocks blindly, i made the receiver sleep
if there's an increase in the scheduling delay (if scheduling delay exceeds
3 times the batch duration). This prototype is working nicely and the speed
is encouraging as its processing at 500MB/S without having any failures so
far.


Thanks
Best Regards

On Fri, May 8, 2015 at 8:11 PM, François Garillot 
francois.garil...@typesafe.com wrote:

 Hi guys,

 We[1] are doing a bit of work on Spark Streaming, to help it face
 situations where the throughput of data on an InputStream is (momentarily)
 susceptible to overwhelm the Receiver(s) memory.

 The JIRA  design doc is here:
 https://issues.apache.org/jira/browse/SPARK-7398

 We'd sure appreciate your comments !

 --
 François Garillot
 [1]: Typesafe  some helpful collaborators on benchmarking 'at scale'



SparkStreaming Workaround for BlockNotFound Exceptions

2015-05-07 Thread Akhil Das
Hi

With Spark streaming (all versions), when my processing delay (around 2-4
seconds) exceeds the batch duration (being 1 second) and on a decent
scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4
cores each) the job will start to throw block not found exceptions when the
Storage is set to MEMORY_ONLY (ensureFreeSpace
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L444
drops
blocks blindly). When i use MEMORY_AND_DISK* as StorageLevel, then the
performance went down drastically and the receivers ends up doing a lot of
Disk IO.

So sticking with StorageLevel as MEMORY_ONLY the workaround to get ride of
the block not found exceptions was to tell the receiver not to generate
more blocks as there are blocks which are yet to get compute.

To achieve this, i used Spark 1.3.1 with the low level kafka consumer
https://github.com/dibbhatt/kafka-spark-consumer, and inside my Job's
onBatchCompleted i pushed the scheduling delay to zookeeper like:

[image: Inline image 1]


And on the receiver end, if there's scheduling delay, then it will simply
sleep for that much of time without sending any blocks to the Streaming
receiver. like:
[image: Inline image 2]


I could also add a condition there not to generate blocks if the scheduling
delay kind of exceeds 2-3 times the batch duration instead of making it
sleep for whatever scheduling delay is happening.

With this, the only problem I'm having is, some batches have empty data as
the receiver went to sleep for those batches. Everything else works nicely
at scale and the block not found is totally gone.

Please let me know your thoughts on this, can we generalize this for Kakfa
receivers with Sparkstreaming? Is it possible to apply this (stopping the
receiver from generating blocks) for all sort of receivers?


Thanks
Best Regards


Re: java.lang.StackOverflowError when recovery from checkpoint in Streaming

2015-04-28 Thread Akhil Das
There's a similar issue reported over here
https://issues.apache.org/jira/browse/SPARK-6847

Thanks
Best Regards

On Tue, Apr 28, 2015 at 7:35 AM, wyphao.2007 wyphao.2...@163.com wrote:

  Hi everyone, I am using val messages =
 KafkaUtils.createDirectStream[String, String, StringDecoder,
 StringDecoder](ssc, kafkaParams, topicsSet) to read data from
 kafka(1k/second), and store the data in windows,the code snippets as
 follow:val windowedStreamChannel =
 streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _,
 new HashPartitioner(numPartition))
   .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) = x
 ++= y,
 (x: TreeSet[Obj], y: TreeSet[Obj]) = x --= y, Minutes(60),
 Seconds(2), numPartition,
 (item: (String, TreeSet[Obj])) = item._2.size != 0)after the
 application  run for an hour,  I kill the application and restart it from
 checkpoint directory, but I  encountered an exception:2015-04-27
 17:52:40,955 INFO  [Driver] - Slicing from 1430126222000 ms to
 1430126222000 ms (aligned to 1430126222000 ms and 1430126222000 ms)
 2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null
 java.lang.StackOverflowError
 at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
 at
 java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
 at java.io.File.exists(File.java:813)
 at
 sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080)
 at sun.misc.URLClassPath.getResource(URLClassPath.java:199)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:358)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
 at org.apache.spark.rdd.RDD.filter(RDD.scala:303)
 at
 org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
 at scala.Option.map(Option.scala:145)
 at
 org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 

Re: Contributing Documentation Changes

2015-04-25 Thread Akhil Das
I also want to add mine :/
Everyone wants to add it seems.

Thanks
Best Regards

On Fri, Apr 24, 2015 at 8:58 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
 I understand that. The following page

 http://spark.apache.org/documentation.html has a external tutorials,blogs
 section which points to other blog pages. I wanted to add there.




 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Fri, Apr 24, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote:

  I think that your own tutorials and such should live on your blog. The
  goal isn't to pull in a bunch of external docs to the site.
 
  On Fri, Apr 24, 2015 at 12:57 AM, madhu phatak phatak@gmail.com
  wrote:
   Hi,
As I was reading contributing to Spark wiki, it was mentioned that we
  can
   contribute external links to spark tutorials. I have written many
   http://blog.madhukaraphatak.com/categories/spark/ of them in my
 blog.
  It
   will be great if someone can add it to the spark website.
  
  
  
   Regards,
   Madhukara Phatak
   http://datamantra.io/
 



Re: Graphical display of metrics on application UI page

2015-04-22 Thread Akhil Das
​There were some PR's about graphical representation with D3.js, you can
possibly see it on the github. Here's a few of them
https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​

Thanks
Best Regards

On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 Dear Spark devs,

 Would people find it useful to have a graphical display of metrics (such as
 duration, GC time, etc) on the application UI page? Has anybody worked on
 this before?

 Punya



Re: How to use Spark Streaming .jar file that I've built using a different branch than master?

2015-04-20 Thread Akhil Das
I think you can override the SPARK_CLASSPATH with your newly built jar.

Thanks
Best Regards

On Mon, Apr 20, 2015 at 2:28 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 I'm building a different version of Spark Streaming (based on a different
 branch than master) in my application for testing purposes, but it seems
 like spark-submit is ignoring my newly built Spark Streaming .jar, and
 using an older version.

 Here's some context:

 I'm on a different branch:

 $ git branch
 * SPARK-3276
   master

 Then I build the Spark Streaming that I've changed:

 ✔ ~/code/spark [SPARK-3276 L|✚ 1]
 $ mvn --projects streaming/ -DskipTests install

 it builds without problems, and then when I check my local Maven
 repository, I see that I have newly generated Spark Streaming jars:

 $ ls -lh
 ~/.m2/repository/org/apache/spark/spark-streaming_2.10/1.4.0-SNAPSHOT/
 total 3.3M
 -rw-rw-r-- 1 emre emre 1.6K Apr 20 10:43 maven-metadata-local.xml
 -rw-rw-r-- 1 emre emre  421 Apr 20 10:43 _remote.repositories
 -rw-rw-r-- 1 emre emre 1.3M Apr 20 10:42
 spark-streaming_2.10-1.4.0-SNAPSHOT.jar
 -rw-rw-r-- 1 emre emre 622K Apr 20 10:43
 spark-streaming_2.10-1.4.0-SNAPSHOT-javadoc.jar
 -rw-rw-r-- 1 emre emre 6.7K Apr 20 10:42
 spark-streaming_2.10-1.4.0-SNAPSHOT.pom
 -rw-rw-r-- 1 emre emre 181K Apr 20 10:42
 spark-streaming_2.10-1.4.0-SNAPSHOT-sources.jar
 -rw-rw-r-- 1 emre emre 1.2M Apr 20 10:42
 spark-streaming_2.10-1.4.0-SNAPSHOT-tests.jar
 -rw-rw-r-- 1 emre emre  82K Apr 20 10:42
 spark-streaming_2.10-1.4.0-SNAPSHOT-test-sources.jar

 Then I build and run an application (in Java) that uses Spark Streaming. In
 that test project's pom.xml I have

 ...
  properties
 project.build.sourceEncodingUTF-8/project.build.sourceEncoding
 hadoop.version2.4.0/hadoop.version
 spark.version1.4.0-SNAPSHOT/spark.version
   /properties
 ...
  dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-streaming_2.10/artifactId
   version${spark.version}/version
   scopeprovided/scope
 /dependency


 And then I use

   ~/code/spark/bin/spark-submit

 to submit my application. It starts fine, and continues to run on my local
 filesystem but when I check the log messages on the console, I don't see
 the changes I have made, and I *did* make changes, e.g. changed some
 logging messages. It is like when I submit my application, it is not using
 the Spark Streaming from *branch SPARK-3276* but from the master branch.

 Any ideas what might be causing this? Is there some form of caching? Or is
 spark-submit using a different .jar for streaming? (Where?)

 How can I see the effects of my changes that I did to Spark Streaming in my
 SPARK-3276 branch?

 --
 Emre Sevinç



Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS?

Thanks
Best Regards

On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com
wrote:

  I'm trying to figure out how I might be able to use Spark with a SOCKS
 proxy.  That is, my dream is to be able to write code in my IDE then run it
 without much trouble on a remote cluster, accessible only via a SOCKS proxy
 between the local development machine and the master node of the
 cluster (ignoring, for now, any dependencies that would need to be
 transferred--assume it's a very simple app with no dependencies that aren't
 part of the Spark classpath on the cluster).  This is possible with Hadoop
 by setting hadoop.rpc.socket.factory.class.default to
 org.apache.hadoop.net.SocksSocketFactory and hadoop.socks.server to
 localhost:port on which a SOCKS proxy has been opened via ssh -D to the
 master node.  However, I can't seem to find anything like this for Spark,
 and I only see very few mentions of it on the user list and on
 stackoverflow, with no real answers.  (See links below.)

  I thought I might be able to use the JVM's -DsocksProxyHost and
 -DsocksProxyPort system properties, but it still does not seem to work.
 That is, if I start a SOCKS proxy to my master node using something like
 ssh -D 2600 master node public name then run a simple Spark app that
 calls SparkConf.setMaster(spark://master node private IP:7077), passing
 in JVM args of -DsocksProxyHost=locahost -DsocksProxyPort=2600, the
 driver hangs for a while before finally giving up (Application has been
 killed. Reason: All masters are unresponsive! Giving up.).  It seems like
 it is not even attempting to use the SOCKS proxy.  Do
 -DsocksProxyHost/-DsocksProxyPort not even work for Spark?


 http://stackoverflow.com/questions/28047000/connect-to-spark-through-a-socks-proxy
  (unanswered
 similar question from somebody else about a month ago)
 https://issues.apache.org/jira/browse/SPARK-5004 (unresolved, somewhat
 related JIRA from a few months ago)

  Thanks,
  Jonathan



Re: Loading previously serialized object to Spark

2015-03-08 Thread Akhil Das
Can you paste the complete code?

Thanks
Best Regards

On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi,

 I've implemented class MyClass in MLlib that does some operation on
 LabeledPoint. MyClass extends serializable, so I can map this operation on
 data of RDD[LabeledPoints], such as data.map(lp = MyClass.operate(lp)). I
 write this class in file with ObjectOutputStream.writeObject. Then I stop
 and restart Spark. I load this class from file with
 ObjectInputStream.readObject.asInstanceOf[MyClass]. When I try to map the
 same operation of this class to RDD, Spark throws not serializable
 exception:
 org.apache.spark.SparkException: Task not serializable
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
 at org.apache.spark.rdd.RDD.map(RDD.scala:273)

 Could you suggest why it throws this exception while MyClass is
 serializable by definition?

 Best regards, Alexander



Re: Pull Requests on github

2015-02-08 Thread Akhil Das
You can open a Jira issue pointing this PR to get it processed faster. :)

Thanks
Best Regards

On Sat, Feb 7, 2015 at 7:07 AM, fommil sam.halli...@gmail.com wrote:

 Hi all,

 I'm the author of netlib-java and I noticed that the documentation in MLlib
 was out of date and misleading, so I submitted a pull request on github
 which will hopefully make things easier for everybody to understand the
 benefits of system optimised natives and how to use them :-)

   https://github.com/apache/spark/pull/4448

 However, it looks like there are a *lot* of outstanding PRs and that this
 is
 just a mirror repository.

 Will somebody please look at my PR and merge into the canonical source (and
 let me know)?

 Best regards,
 Sam



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Broken record a bit here: building spark on intellij with sbt

2015-02-04 Thread Akhil Das
Here's the sbt version
https://docs.sigmoidanalytics.com/index.php/Step_by_Step_instructions_on_how_to_build_Spark_App_with_IntelliJ_IDEA


Thanks
Best Regards

On Thu, Feb 5, 2015 at 8:55 AM, Stephen Boesch java...@gmail.com wrote:

 For building in intellij with sbt my mileage has varied widely: it had
 built as late as Monday (after the 1.3.0 release)  - and with zero
 'special' steps: just import as sbt project.

  However I can not presently repeat the process.  The wiki page has the
 latest instructions on how to build with maven - but not with sbt.  Is
 there a resource for that?


 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IntelliJ

 The error I see is same as from a post in July


 http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-td9904.html

 Here is an excerpt:

 uncaught exception during compilation: java.lang.AssertionError
 Error:scalac: Error: assertion failed:
 com.google.protobuf.InvalidProtocolBufferException
 java.lang.AssertionError: assertion failed:
 com.google.protobuf.InvalidProtocolBufferException
 at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1212)

 The answer in the mailing list to that thread was about using maven .. so
 that is not useful here.



Re: Memory config issues

2015-01-18 Thread Akhil Das
Its the executor memory (spark.executor.memory) which you can set while
creating the spark context. By default it uses 0.6% of the executor memory
for Storage. Now, to show some memory usage, you need to cache (persist)
the RDD. Regarding the OOM Exception, you can increase the level of
parallelism (also you can increase the number of partitions depending on
your data size) and it should be fine.

Thanks
Best Regards

On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

  All,

 I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have
 plenty of RAM, so I should be able to brute-force my way through, but I
 can't quite figure out what memory option affects what process.

 My current memory configuration is the following:
 export SPARK_WORKER_MEMORY=83971m
 export SPARK_DAEMON_MEMORY=15744m

 What does each of these config options do exactly?

 Also, how come the executors page of the web UI shows no memory usage:

 0.0 B / 42.4 GB

 And where does 42.4 GB come from?

 Alex



Re: Bouncing Mails

2015-01-17 Thread Akhil Das
Yep. They have sorted it out it seems.
On 18 Jan 2015 03:58, Patrick Wendell pwend...@gmail.com wrote:

 Akhil,

 Those are handled by ASF infrastructure, not anyone in the Spark
 project. So this list is not the appropriate place to ask for help.

 - Patrick

 On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  My mails to the mailing list are getting rejected, have opened a Jira
 issue,
  can someone take a look at it?
 
  https://issues.apache.org/jira/browse/INFRA-9032
 
 
 
 
 
 
  Thanks
  Best Regards



Bouncing Mails

2015-01-17 Thread Akhil Das
My mails to the mailing list are getting rejected, have opened a Jira
issue, can someone take a look at it?

https://issues.apache.org/jira/browse/INFRA-9032






Thanks
Best Regards


Re: Apache Spark client high availability

2015-01-12 Thread Akhil Das
We usually run Spark in HA with the following stack:

- Apache Mesos
- Marathon - init/control system for starting, stopping, and maintaining
always-on applications.(Mainly SparkStreaming)
- Chronos - general-purpose scheduler for Mesos, supports job dependency
graphs.
- Spark Job Server - primarily for it's ability to reuse shared contexts
with multiple jobs

​This thread has a better discussion
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-td7935.html
​


Thanks
Best Regards

On Mon, Jan 12, 2015 at 10:08 PM, preeze etan...@gmail.com wrote:

 Dear community,

 I've been searching the internet for quite a while to find out what is the
 best architecture to support HA for a spark client.

 We run an application that connects to a standalone Spark cluster and
 caches
 a big chuck of data for subsequent intensive calculations. To achieve HA
 we'll need to run several instances of the application on different hosts.

 Initially I explored the option to reuse (i.e. share) the same executors
 set
 between SparkContext instances of all running applications. Found it
 impossible.

 So, every application, which creates an instance of SparkContext, has to
 spawn its own executors. Externalizing and sharing executors' memory cache
 with Tachyon is a semi-solution since each application's executors will
 keep
 using their own set of CPU cores.

 Spark-jobserver is another possibility. It manages SparkContext itself and
 accepts job requests from multiple clients for the same context which is
 brilliant. However, this becomes a new single point of failure.

 Now I am exploring if it's possible to run the Spark cluster in YARN
 cluster
 mode and connect to the driver from multiple clients.

 Is there anything I am missing guys?
 Any suggestion is highly appreciated!



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-client-high-availability-tp10088.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Reading Data Using TextFileStream

2015-01-06 Thread Akhil Das
I think you need to start your streaming job, then put the files there to
get them read. textFileStream doesn't read the existing files i believe.

Also are you sure the path is not the following? (no missing / in the
beginning?)

JavaDStreamString textStream = ssc.textFileStream(/user/
huser/user/huser/flume);


Thanks
Best Regards

On Wed, Jan 7, 2015 at 9:16 AM, Jeniba Johnson 
jeniba.john...@lntinfotech.com wrote:


 Hi Hari,

 Iam trying to read data from a file which is stored in HDFS. Using Flume
 the data is tailed and stored in HDFS.
 Now I want to read this data using TextFileStream. Using the below
 mentioned code Iam not able to fetch the
 Data  from a file which is stored in HDFS. Can anyone help me with this
 issue.

 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.FlatMapFunction;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.streaming.Duration;
 import org.apache.spark.streaming.api.java.JavaDStream;
 import org.apache.spark.streaming.api.java.JavaStreamingContext;

 import com.google.common.collect.Lists;

 import java.util.Arrays;
 import java.util.List;
 import java.util.regex.Pattern;

 public final class Test1 {
   public static void main(String[] args) throws Exception {

 SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount);
 JavaStreamingContext ssc = new
 JavaStreamingContext(local[4],JavaWordCount,  new Duration(2));

 JavaDStreamString textStream =
 ssc.textFileStream(user/huser/user/huser/flume);//Data Directory Path in
 HDFS


 JavaDStreamString suspectedStream = textStream.flatMap(new
 FlatMapFunctionString,String()
  {
 public IterableString call(String line)
 throws Exception {

 //return
 Arrays.asList(line.toString().toString());
return
 Lists.newArrayList(line.toString().toString());
  }
  });


 suspectedStream.foreach(new FunctionJavaRDDString,Void(){

 public Void call(JavaRDDString rdd) throws Exception {
 ListString output = rdd.collect();
 System.out.println(Sentences Collected from Flume  + output);
return  null;
 }
 });

 suspectedStream.print();

 System.out.println(Welcome TO Flume Streaming);
 ssc.start();
 ssc.awaitTermination();
   }

 }

 The command I use is:
 ./bin/spark-submit --verbose --jars
 lib/spark-examples-1.1.0-hadoop1.0.4.jar,lib/mysql.jar --master local[*]
 --deploy-mode client --class xyz.Test1 bin/filestream3.jar





 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Akhil Das
It shows nullPointerException, your data could be corrupted? Try putting a
try catch inside the operation that you are doing, Are you running the
worker process on the master node also? If not, then only 1 node will be
doing the processing. If yes, then try setting the level of parallelism and
number of partitions while creating/transforming the RDD.

Thanks
Best Regards

On Fri, Nov 14, 2014 at 5:17 PM, Priya Ch learnings.chitt...@gmail.com
wrote:

 Hi All,

   We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is
 having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set
 up hdfs which has 2 TB capacity and the block size is 256 mb   When we try
 to process 1 gb file on spark, we see the following exception

 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 0.0 in
 stage 0.0 (TID 0, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 1.0 in
 stage 0.0 (TID 1, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 2.0 in
 stage 0.0 (TID 2, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:43 INFO cluster.SparkDeploySchedulerBackend: Registered
 executor: 
 Actor[akka.tcp://sparkExecutor@IMPETUS-DSRV02:41124/user/Executor#539551156]
 with ID 0
 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
 manager NODE-DSRV05.impetus.co.in:60432 with 2.1 GB RAM
 14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
 manager NODE-DSRV02:47844 with 2.1 GB RAM
 14/11/14 17:01:43 INFO network.ConnectionManager: Accepted connection from
 [NODE-DSRV05.impetus.co.in/192.168.145.195:51447]
 14/11/14 17:01:43 INFO network.SendingConnection: Initiating connection to
 [NODE-DSRV05.impetus.co.in/192.168.145.195:60432]
 14/11/14 17:01:43 INFO network.SendingConnection: Connected to [
 NODE-DSRV05.impetus.co.in/192.168.145.195:60432], 1 messages pending
 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 17.1 KB, free: 2.1 GB)
 14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
 in memory on NODE-DSRV05.impetus.co.in:60432 (size: 14.1 KB, free: 2.1 GB)
 14/11/14 17:01:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
 0.0 (TID 0, NODE-DSRV05.impetus.co.in): java.lang.NullPointerException:
 org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)
 org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 java.lang.Thread.run(Thread.java:722)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.1 in
 stage 0.0 (TID 3, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 1.0 in stage
 0.0 (TID 1) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 1]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.0 in stage
 0.0 (TID 2) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 2]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.1 in
 stage 0.0 (TID 4, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 1.1 in
 stage 0.0 (TID 5, NODE-DSRV02, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 0.1 in stage
 0.0 (TID 3) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 3]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.2 in
 stage 0.0 (TID 6, NODE-DSRV02, NODE_LOCAL, 1667 bytes)
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.1 in stage
 0.0 (TID 4) on executor NODE-DSRV05.impetus.co.in:
 java.lang.NullPointerException (null) [duplicate 4]
 14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.2 in
 stage 0.0 (TID 7, NODE-DSRV02, NODE_LOCAL, 1667 bytes)


 What I see is, it couldnt launch tasks on NODE-DSRV05 and processing it on
 single node i.e NODE-DSRV02. When we tried with 360 MB of data, I dont see
 any exception but the entire processing is done by only one node. I 

Re: java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-23 Thread Akhil Das
Hi Prabeesh,

Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can
do a ps aux | grep shark and see how much memory it is being allocated,
mostly it should be 512mb, in that case increase the limit.

Thanks
Best Regards


On Fri, May 23, 2014 at 10:22 AM, prabeesh k prabsma...@gmail.com wrote:


 Hi,

 I am trying to apply  inner join in shark using 64MB and 27MB files. I am
 able to run the following queris on Mesos


- SELECT * FROM geoLocation1 



-  SELECT * FROM geoLocation1  WHERE  country =  'US' 


 But while trying inner join as

  SELECT * FROM geoLocation1 g1 INNER JOIN geoBlocks1 g2 ON (g1.locId =
 g2.locId)



 I am getting following error as follows.


 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Task 1.0:7 failed 4 times (most recent failure: Exception failure:
 java.lang.OutOfMemoryError: Java heap space)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
  at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
  at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Please help me to resolve this.

 Thanks in adv

 regards,
 prabeesh