Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
Yes you are right I initially started from master node but what happened
suddenly after 2 days that workers dies is what I am interested in knowing
, is it possible that workers got disconnected because of some network
issue and then they tried tried starting themselves but kept failing ?

On Sun, Feb 14, 2016 at 11:21 PM, Prabhu Joseph 
wrote:

> Kartik,
>
>  Spark Workers won't start if SPARK_MASTER_IP is wrong, maybe you
> would have used start_slaves.sh from Master node to start all worker nodes,
> where Workers would have got correct SPARK_MASTER_IP initially. Later any
> restart from slave nodes would have failed because of wrong SPARK_MASTER_IP
> at worker nodes.
>
>Check the logs of other workers running to see what SPARK_MASTER_IP it
> has connected, I don't think it is using a wrong Master IP.
>
>
> Thanks,
> Prabhu Joseph
>
> On Mon, Feb 15, 2016 at 12:34 PM, Kartik Mathur 
> wrote:
>
>> Thanks Prabhu ,
>>
>> I had wrongly configured spark_master_ip in worker nodes to `hostname -f`
>> which is the worker and not master ,
>>
>> but now the question is *why the cluster was up initially for 2 days*
>> and workers realized of this invalid configuration after 2 days ? And why
>> other workers are still up even through they have the same setting ?
>>
>> Really appreciate your help
>>
>> Thanks,
>> Kartik
>>
>> On Sun, Feb 14, 2016 at 10:53 PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Kartik,
>>>
>>>The exception stack trace
>>> *java.util.concurrent.RejectedExecutionException* will happen if
>>> SPARK_MASTER_IP in worker nodes are configured wrongly like if
>>> SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect
>>> to IP of master node. Check whether SPARK_MASTER_IP in Worker nodes are
>>> exactly the same as what Spark Master GUI shows.
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>> On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur 
>>> wrote:
>>>
 on spark 1.5.2
 I have a spark standalone cluster with 6 workers , I left the cluster
 idle for 3 days and after 3 days I saw only 4 workers on the spark master
 UI , 2 workers died with the same exception -

 Strange part is cluster was running stable for 2 days but on third day
 2 workers abruptly died . I am see this error in one of the affected worker
 . No job ran for 2 days.



 2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed!
 Waiting for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 -
 Connection to master failed! Waiting for master to reconnect...2016-02-14
 01:13:10 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in
 thread
 Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException:
 Task java.util.concurrent.FutureTask@514b13ad rejected from
 java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size =
 1, active threads = 1, queued tasks = 0, completed tasks = 3]at
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
at
 org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
  at 
 org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
at
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
Kartik,

 Spark Workers won't start if SPARK_MASTER_IP is wrong, maybe you would
have used start_slaves.sh from Master node to start all worker nodes, where
Workers would have got correct SPARK_MASTER_IP initially. Later any restart
from slave nodes would have failed because of wrong SPARK_MASTER_IP at
worker nodes.

   Check the logs of other workers running to see what SPARK_MASTER_IP it
has connected, I don't think it is using a wrong Master IP.


Thanks,
Prabhu Joseph

On Mon, Feb 15, 2016 at 12:34 PM, Kartik Mathur  wrote:

> Thanks Prabhu ,
>
> I had wrongly configured spark_master_ip in worker nodes to `hostname -f`
> which is the worker and not master ,
>
> but now the question is *why the cluster was up initially for 2 days* and
> workers realized of this invalid configuration after 2 days ? And why other
> workers are still up even through they have the same setting ?
>
> Really appreciate your help
>
> Thanks,
> Kartik
>
> On Sun, Feb 14, 2016 at 10:53 PM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Kartik,
>>
>>The exception stack trace
>> *java.util.concurrent.RejectedExecutionException* will happen if
>> SPARK_MASTER_IP in worker nodes are configured wrongly like if
>> SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect
>> to IP of master node. Check whether SPARK_MASTER_IP in Worker nodes are
>> exactly the same as what Spark Master GUI shows.
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>> On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur 
>> wrote:
>>
>>> on spark 1.5.2
>>> I have a spark standalone cluster with 6 workers , I left the cluster
>>> idle for 3 days and after 3 days I saw only 4 workers on the spark master
>>> UI , 2 workers died with the same exception -
>>>
>>> Strange part is cluster was running stable for 2 days but on third day 2
>>> workers abruptly died . I am see this error in one of the affected worker .
>>> No job ran for 2 days.
>>>
>>>
>>>
>>> 2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed!
>>> Waiting for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 -
>>> Connection to master failed! Waiting for master to reconnect...2016-02-14
>>> 01:13:10 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in
>>> thread
>>> Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException:
>>> Task java.util.concurrent.FutureTask@514b13ad rejected from
>>> java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size =
>>> 1, active threads = 1, queued tasks = 0, completed tasks = 3]at
>>> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>>>at
>>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>>>at
>>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)
>>>at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
>>>  at 
>>> org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
>>>at
>>> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
>>>at 
>>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
>>>at
>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
>>>at 
>>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
>>>at
>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
>>>at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>  at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>at akka.actor.Actor$class.aroundReceive(Actor.scala:467)at
>>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
>>>at 

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
Thanks Prabhu ,

I had wrongly configured spark_master_ip in worker nodes to `hostname -f`
which is the worker and not master ,

but now the question is *why the cluster was up initially for 2 days* and
workers realized of this invalid configuration after 2 days ? And why other
workers are still up even through they have the same setting ?

Really appreciate your help

Thanks,
Kartik

On Sun, Feb 14, 2016 at 10:53 PM, Prabhu Joseph 
wrote:

> Kartik,
>
>The exception stack trace
> *java.util.concurrent.RejectedExecutionException* will happen if
> SPARK_MASTER_IP in worker nodes are configured wrongly like if
> SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect
> to IP of master node. Check whether SPARK_MASTER_IP in Worker nodes are
> exactly the same as what Spark Master GUI shows.
>
>
> Thanks,
> Prabhu Joseph
>
> On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur 
> wrote:
>
>> on spark 1.5.2
>> I have a spark standalone cluster with 6 workers , I left the cluster
>> idle for 3 days and after 3 days I saw only 4 workers on the spark master
>> UI , 2 workers died with the same exception -
>>
>> Strange part is cluster was running stable for 2 days but on third day 2
>> workers abruptly died . I am see this error in one of the affected worker .
>> No job ran for 2 days.
>>
>>
>>
>> 2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed!
>> Waiting for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 -
>> Connection to master failed! Waiting for master to reconnect...2016-02-14
>> 01:13:10 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in
>> thread
>> Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException:
>> Task java.util.concurrent.FutureTask@514b13ad rejected from
>> java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size = 1,
>> active threads = 1, queued tasks = 0, completed tasks = 3]at
>> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>>at
>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>>at
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>>at
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>>at
>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)
>>at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
>>  at 
>> org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
>>at
>> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
>>at 
>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
>>at
>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
>>at 
>> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
>>at
>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
>>at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>  at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>at akka.actor.Actor$class.aroundReceive(Actor.scala:467)at
>> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
>>at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>  at akka.actor.ActorCell.invoke(ActorCell.scala:487)at
>> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)at
>> akka.dispatch.Mailbox.run(Mailbox.scala:220)at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>  at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>  at
>> 

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
Kartik,

   The exception stack trace
*java.util.concurrent.RejectedExecutionException* will happen if
SPARK_MASTER_IP in worker nodes are configured wrongly like if
SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect
to IP of master node. Check whether SPARK_MASTER_IP in Worker nodes are
exactly the same as what Spark Master GUI shows.


Thanks,
Prabhu Joseph

On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur  wrote:

> on spark 1.5.2
> I have a spark standalone cluster with 6 workers , I left the cluster idle
> for 3 days and after 3 days I saw only 4 workers on the spark master UI , 2
> workers died with the same exception -
>
> Strange part is cluster was running stable for 2 days but on third day 2
> workers abruptly died . I am see this error in one of the affected worker .
> No job ran for 2 days.
>
>
>
> 2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed! Waiting
> for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 - Connection
> to master failed! Waiting for master to reconnect...2016-02-14 01:13:10
> ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread
> Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException:
> Task java.util.concurrent.FutureTask@514b13ad rejected from
> java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size = 1,
> active threads = 1, queued tasks = 0, completed tasks = 3]at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>at
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)
>at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
>  at 
> org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
>at
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
>at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
>at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
>at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
>at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
>at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>  at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>at akka.actor.Actor$class.aroundReceive(Actor.scala:467)at
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
>at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:487)at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)at
> akka.dispatch.Mailbox.run(Mailbox.scala:220)at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
> down votefavorite
> 
>
>


Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
on spark 1.5.2
I have a spark standalone cluster with 6 workers , I left the cluster idle
for 3 days and after 3 days I saw only 4 workers on the spark master UI , 2
workers died with the same exception -

Strange part is cluster was running stable for 2 days but on third day 2
workers abruptly died . I am see this error in one of the affected worker .
No job ran for 2 days.



2016-02-14 01:12:59 ERROR Worker:75 - Connection to master failed! Waiting
for master to reconnect...2016-02-14 01:12:59 ERROR Worker:75 - Connection
to master failed! Waiting for master to reconnect...2016-02-14 01:13:10
ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread
Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]java.util.concurrent.RejectedExecutionException:
Task java.util.concurrent.FutureTask@514b13ad rejected from
java.util.concurrent.ThreadPoolExecutor@17f8ec8d[Running, pool size = 1,
active threads = 1, queued tasks = 0, completed tasks = 3]at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
   at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
   at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
   at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
   at
org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)
   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
 at 
org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
   at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
   at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
   at
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
   at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
   at
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
   at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
   at
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)at
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)at
akka.dispatch.Mailbox.run(Mailbox.scala:220)at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
   at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



down votefavorite



How to join an RDD with a hive table?

2016-02-14 Thread SRK
Hi,

How to join an RDD with a hive table and retrieve only the records that I am
interested. Suppose, I have an RDD that has 1000 records and there is a Hive
table with 100,000 records, I should be able to join the RDD with the hive
table  by an Id and I should be able to load only those 1000 records from
Hive table so that are no memory issues. Also, I was planning on storing the
data in hive in the form of parquet files. Any help on this is greatly
appreciated.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to query a hive table from inside a map in Spark

2016-02-14 Thread Alex Kozlov
While this is possible via jdbc calls, it is not the best practice: you
should probably use variable broadcasting

instead.

On Sun, Feb 14, 2016 at 8:40 PM, SRK  wrote:

> Hi,
>
> Is it possible to query a hive table which has data stored in the form of a
> parquet file from inside map/partitions in Spark? My requirement is that I
> have a User table in Hive/hdfs and for each record inside a sessions RDD, I
> should be able to query the User table and if the User table already has a
> record for that userId, query the record and do further processing.
>
>
> Thanks!
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-hive-table-from-inside-a-map-in-Spark-tp26224.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Alex Kozlov
Praveen, the mode in which you run spark (standalone, yarn, mesos) is
determined when you create SparkContext
.
You are right that spark-submit and spark-shell create different
SparkContexts.

In general, resource sharing is the task of the cluster scheduler and there
are only 3 choices by default right now.

On Sun, Feb 14, 2016 at 9:13 PM, praveen S  wrote:

> Even i was trying to launch spark jobs from webservice :
>
> But I thought you could run spark jobs in yarn mode only through
> spark-submit. Is my understanding not correct?
>
> Regards,
> Praveen
> On 15 Feb 2016 08:29, "Sabarish Sasidharan" <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Yes you can look at using the capacity scheduler or the fair scheduler
>> with YARN. Both allow using full cluster when idle. And both allow
>> considering cpu plus memory when allocating resources which is sort of
>> necessary with Spark.
>>
>> Regards
>> Sab
>> On 13-Feb-2016 10:11 pm, "Eugene Morozov" 
>> wrote:
>>
>>> Hi,
>>>
>>> I have several instances of the same web-service that is running some ML
>>> algos on Spark (both training and prediction) and do some Spark unrelated
>>> job. Each web-service instance creates their on JavaSparkContext, thus
>>> they're seen as separate applications by Spark, thus they're configured
>>> with separate limits of resources such as cores (I'm not concerned about
>>> the memory as much as about cores).
>>>
>>> With this set up, say 3 web service instances, each of them has just 1/3
>>> of cores. But it might happen, than only one instance is going to use
>>> Spark, while others are busy with Spark unrelated. I'd like in this case
>>> all Spark cores be available for the one that's in need.
>>>
>>> Ideally I'd like Spark cores just be available in total and the first
>>> app who needs it, takes as much as required from the available at the
>>> moment. Is it possible? I believe Mesos is able to set resources free if
>>> they're not in use. Is it possible with YARN?
>>>
>>> I'd appreciate if you could share your thoughts or experience on the
>>> subject.
>>>
>>> Thanks.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>

--
Alex Kozlov
ale...@gmail.com


Re: Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Alexander Pivovarov
Consider streaming for real time cases
http://zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/

On Sun, Feb 14, 2016 at 7:28 PM, Divya Gehlot 
wrote:

> Hi,
> I would like to know difference between spark-shell and spark-submit in
> terms of real time scenarios.
>
> I am using Hadoop cluster with Spark on EC2.
>
>
> Thanks,
> Divya
>


Unable to insert overwrite table with Spark 1.5.2

2016-02-14 Thread Ramanathan R
Hi All,

Spark 1.5.2 does not seem to be backward compatible with functionality that
was available in earlier versions, at least in 1.3.1 and 1.4.1. It is not
possible to insert overwrite into an existing table that was read as a
DataFrame initially.

Our existing code base has few internal Hive tables being overwritten after
some join operations.

For e.g.
val PRODUCT_TABLE = "product_dim"
val productDimDF = hiveContext.table(PRODUCT_TABLE)
// Joins, filters ...
productDimDF.write.mode(SaveMode.Overwrite).insertInto(PRODUCT_TABLE)

This results in the exception -
org.apache.spark.sql.AnalysisException: Cannot overwrite table
`product_dim` that is also being read from.;
at
org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:82)
at
org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$2.apply(rules.scala:155)
at
org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$2.apply(rules.scala:85)

Is there any configuration to disable this particular rule? Any pointers to
solve this would be very helpful.

Thanks,
Ram


Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread praveen S
Even i was trying to launch spark jobs from webservice :

But I thought you could run spark jobs in yarn mode only through
spark-submit. Is my understanding not correct?

Regards,
Praveen
On 15 Feb 2016 08:29, "Sabarish Sasidharan" 
wrote:

> Yes you can look at using the capacity scheduler or the fair scheduler
> with YARN. Both allow using full cluster when idle. And both allow
> considering cpu plus memory when allocating resources which is sort of
> necessary with Spark.
>
> Regards
> Sab
> On 13-Feb-2016 10:11 pm, "Eugene Morozov" 
> wrote:
>
>> Hi,
>>
>> I have several instances of the same web-service that is running some ML
>> algos on Spark (both training and prediction) and do some Spark unrelated
>> job. Each web-service instance creates their on JavaSparkContext, thus
>> they're seen as separate applications by Spark, thus they're configured
>> with separate limits of resources such as cores (I'm not concerned about
>> the memory as much as about cores).
>>
>> With this set up, say 3 web service instances, each of them has just 1/3
>> of cores. But it might happen, than only one instance is going to use
>> Spark, while others are busy with Spark unrelated. I'd like in this case
>> all Spark cores be available for the one that's in need.
>>
>> Ideally I'd like Spark cores just be available in total and the first app
>> who needs it, takes as much as required from the available at the moment.
>> Is it possible? I believe Mesos is able to set resources free if they're
>> not in use. Is it possible with YARN?
>>
>> I'd appreciate if you could share your thoughts or experience on the
>> subject.
>>
>> Thanks.
>> --
>> Be well!
>> Jean Morozov
>>
>


Re: which master option to view current running job in Spark UI

2016-02-14 Thread Sabarish Sasidharan
When running in YARN, you can use the YARN Resource Manager UI to get to
the ApplicationMaster url, irrespective of client or cluster mode.

Regards
Sab
On 15-Feb-2016 10:10 am, "Divya Gehlot"  wrote:

> Hi,
> I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files
> .
> I am bit confused between using *master  *options
> I want to execute this spark job in YARN
>
> Curently running as
> spark-shell --properties-file  /TestDivya/Spark/Oracle.properties --jars
> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path
> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages
> com.databricks:spark-csv_2.10:1.1.0  *--master yarn-client *  -i
> /TestDivya/Spark/Test.scala
>
> with this option I cant see the currently running jobs in Spark WEB UI
> though it later appear in spark history server.
>
> My question with which --master option should I run my spark jobs so that
> I can view the currently running jobs in spark web UI .
>
> Thanks,
> Divya
>


How to query a hive table from inside a map in Spark

2016-02-14 Thread SRK
Hi,

Is it possible to query a hive table which has data stored in the form of a
parquet file from inside map/partitions in Spark? My requirement is that I
have a User table in Hive/hdfs and for each record inside a sessions RDD, I
should be able to query the User table and if the User table already has a
record for that userId, query the record and do further processing.


Thanks!







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-hive-table-from-inside-a-map-in-Spark-tp26224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



which master option to view current running job in Spark UI

2016-02-14 Thread Divya Gehlot
Hi,
I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files
.
I am bit confused between using *master  *options
I want to execute this spark job in YARN

Curently running as
spark-shell --properties-file  /TestDivya/Spark/Oracle.properties --jars
/usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path
/usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages
com.databricks:spark-csv_2.10:1.1.0  *--master yarn-client *  -i
/TestDivya/Spark/Test.scala

with this option I cant see the currently running jobs in Spark WEB UI
though it later appear in spark history server.

My question with which --master option should I run my spark jobs so that I
can view the currently running jobs in spark web UI .

Thanks,
Divya


Re: support vector machine does not classify properly?

2016-02-14 Thread prem09




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/support-vector-machine-does-not-classify-properly-tp26216p26223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
For YARN mode, you can set --executor-cores 1

-Original Message-
From: Sun, Rui [mailto:rui@intel.com] 
Sent: Monday, February 15, 2016 11:35 AM
To: Simon Hafner ; user 
Subject: RE: Running synchronized JRI code

Yes, JRI loads an R dynamic library into the executor JVM, which faces 
thread-safe issue when there are multiple task threads within the executor.

If you are running Spark on Standalone mode, it is possible to run multiple 
workers per node, and at the same time, limit the cores per worker to be 1. 

You could use RDD.pipe(), but you may need handle binary-text conversion as the 
input/output to/from the R process is string-based.

I am thinking if the demand like yours (calling R code in RDD transformations) 
is much desired, we may consider refactoring RRDD for this purpose, although it 
is currently intended for internal use by SparkR and not a public API. 

-Original Message-
From: Simon Hafner [mailto:reactorm...@gmail.com] 
Sent: Monday, February 15, 2016 5:09 AM
To: user 
Subject: Running synchronized JRI code

Hello

I'm currently running R code in an executor via JRI. Because R is 
single-threaded, any call to R needs to be wrapped in a `synchronized`. Now I 
can use a bit more than one core per executor, which is undesirable. Is there a 
way to tell spark that this specific application (or even specific UDF) needs 
multiple JVMs? Or should I switch from JRI to a pipe-based (slower) setup?

Cheers,
Simon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-14 Thread Kevin Burton
Afternoon.

About 6 months ago I tried (and failed) to get Spark and Cassandra working
together in production due to dependency hell.

I'm going to give it another try!

Here's my general strategy.

I'm going to create a maven module for my code... with spark dependencies.

Then I'm going to get that to run and have unit tests for reading from
files and writing the data back out the way I want via spark jobs.

Then I'm going to setup cassandra unit to embed cassandra in my project.
Then I'm going to point Spark to Cassandra and have the same above code
work with Cassandra but instead of reading from a file it reads/writes to
C*.

Then once testing is working I'm going to setup spark in cluster mode with
the same dependencies.

Does this sound like a reasonable strategy?

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
Yes, JRI loads an R dynamic library into the executor JVM, which faces 
thread-safe issue when there are multiple task threads within the executor.

If you are running Spark on Standalone mode, it is possible to run multiple 
workers per node, and at the same time, limit the cores per worker to be 1. 

You could use RDD.pipe(), but you may need handle binary-text conversion as the 
input/output to/from the R process is string-based.

I am thinking if the demand like yours (calling R code in RDD transformations) 
is much desired, we may consider refactoring RRDD for this purpose, although it 
is currently intended for internal use by SparkR and not a public API. 

-Original Message-
From: Simon Hafner [mailto:reactorm...@gmail.com] 
Sent: Monday, February 15, 2016 5:09 AM
To: user 
Subject: Running synchronized JRI code

Hello

I'm currently running R code in an executor via JRI. Because R is 
single-threaded, any call to R needs to be wrapped in a `synchronized`. Now I 
can use a bit more than one core per executor, which is undesirable. Is there a 
way to tell spark that this specific application (or even specific UDF) needs 
multiple JVMs? Or should I switch from JRI to a pipe-based (slower) setup?

Cheers,
Simon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Divya Gehlot
Hi,
I would like to know difference between spark-shell and spark-submit in
terms of real time scenarios.

I am using Hadoop cluster with Spark on EC2.


Thanks,
Divya


Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Deng Ching-Mallete
Hi Mich,

For the --jars parameter, just pass in the jars as comma-delimited. As for
the --driver-class-path, make it colon-delimited -- similar to how you set
multiple paths for an environment variable (e.g. --driver-class-path
/home/hduser/jars/jconn4.jar:/home/hduse/jars/ojdbc6.jar). But if you're
already passing the jar files via the --jars param, you shouldn't need to
set them in the driver-class-path though, since they should already be
automatically added to the classpath.

HTH,
Deng

On Mon, Feb 15, 2016 at 7:35 AM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> Is there anyway one can pass multiple --driver-class-path and multiple
> –jars to spark shell.
>
>
>
> For example something as below with two jar files entries for Oracle
> (ojdbc6.jar) and Sybase IQ (jcoon4,jar)
>
>
>
> spark-shell --master spark://50.140.197.217:7077 --driver-class-path
> /home/hduser/jars/jconn4.jar  --driver-class-path
> /home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars
> /home/hduser/jars/ojdbc6.jar
>
>
>
>
>
> This works for one jar file only and you need to add the jar file to both 
> --driver-class-path
> and –jars. I have not managed to work for more than one type of JAR file
>
>
>
> Thanks,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>


Re: IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Saisai Shao
Hi Divya,

Would you please provide full stack of exception? From my understanding
--executor-cores should be worked, we could know better if you provide the
full stack trace.

The performance relies on many different aspects, I'd recommend you to
check the spark web UI to know the application runtime better.

Spark shell is a cmdline Spark application for you to interactively execute
spark jobs, whereas Spark-submit is used to submit your own spark
applications (
http://spark.apache.org/docs/latest/submitting-applications.html).

Thanks
Saisai

On Mon, Feb 15, 2016 at 10:36 AM, Divya Gehlot 
wrote:

> Hi,
>
> I am starting spark-shell with following options :
> spark-shell --properties-file  /TestDivya/Spark/Oracle.properties --jars
> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path
> /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages
> com.databricks:spark-csv_2.10:1.1.0  --master yarn-client --num-executors
> 10 --executor-cores 4 -i /TestDivya/Spark/Test.scala
>
> Got few queries :
> 1.Error :
> java.lang.IllegalStateException: SparkContext has been shutdown
>
> If I remove --executor-cores 4 .. It runs smoothly
>
> 2. with --num-executors 10 my spark job takes more time .
>  May I know why ?
>
> 3. Whats the difference between spark-shell and spark-submit
>
> I am new bee to Spark ..Apologies for such naive questions.
> Just  trying to figure out how to tune spark jobs to increase performance
> on Hadoop cluster on EC2.
> If anybody has real time experience ,please help me.
>
>
> Thanks,
> Divya
>


Re: coalesce and executor memory

2016-02-14 Thread Sabarish Sasidharan
I believe you will gain more understanding if you look at or use
mapPartitions()

Regards
Sab
On 15-Feb-2016 8:38 am, "Christopher Brady" 
wrote:

> I tried it without the cache, but it didn't change anything. The reason
> for the cache is that other actions will be performed on this RDD, even
> though it never gets that far.
>
> I can make it work by just increasing the number of partitions, but I was
> hoping to get a better understanding of how Spark works rather that just
> use trial and error every time I hit this issue.
>
>
> - Original Message -
> From: silvio.fior...@granturing.com
> To: christopher.br...@oracle.com, ko...@tresata.com
> Cc: user@spark.apache.org
> Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern
> Subject: RE: coalesce and executor memory
>
> Actually, rereading your email I see you're caching. But ‘cache’ uses
> MEMORY_ONLY. Do you see errors about losing partitions as your job is
> running?
>
>
>
> Are you sure you need to cache if you're just saving to disk? Can you try
> the coalesce without cache?
>
>
>
>
>
> *From: *Christopher Brady 
> *Sent: *Friday, February 12, 2016 8:34 PM
> *To: *Koert Kuipers ; Silvio Fiorito
> 
> *Cc: *user 
> *Subject: *Re: coalesce and executor memory
>
>
> Thank you for the responses. The map function just changes the format of
> the record slightly, so I don't think that would be the cause of the memory
> problem.
>
> So if I have 3 cores per executor, I need to be able to fit 3 partitions
> per executor within whatever I specify for the executor memory? Is there a
> way I can programmatically find a number of partitions I can coalesce down
> to without running out of memory? Is there some documentation where this is
> explained?
>
>
> On 02/12/2016 05:10 PM, Koert Kuipers wrote:
>
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
> as you coalesce you reduce number of partitions, increasing partition
> size. at some point the partition no longer fits in memory.
>
> On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Coalesce essentially reduces parallelism, so fewer cores are getting more
>> records. Be aware that it could also lead to loss of data locality,
>> depending on how far you reduce. Depending on what you’re doing in the map
>> operation, it could lead to OOM errors. Can you give more details as to
>> what the code for the map looks like?
>>
>>
>>
>>
>> On 2/12/16, 1:13 PM, "Christopher Brady" < 
>> christopher.br...@oracle.com> wrote:
>>
>> >Can anyone help me understand why using coalesce causes my executors to
>> >crash with out of memory? What happens during coalesce that increases
>> >memory usage so much?
>> >
>> >If I do:
>> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>> >
>> >everything works fine, but if I do:
>> >hadoopFile -> sample -> coalesce -> cache -> map ->
>> saveAsNewAPIHadoopFile
>> >
>> >my executors crash with out of memory exceptions.
>> >
>> >Is there any documentation that explains what causes the increased
>> >memory requirements with coalesce? It seems to be less of a problem if I
>> >coalesce into a larger number of partitions, but I'm not sure why this
>> >is. How would I estimate how much additional memory the coalesce
>> requires?
>> >
>> >Thanks.
>> >
>> >-
>> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >For additional commands, e-mail: 
>> user-h...@spark.apache.org
>> >
>>
>
>
>


Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Sabarish Sasidharan
Looks like your executors are running out of memory. YARN is not kicking
them out. Just increase the executor memory. Also considering increasing
the parallelism ie the number of partitions.

Regards
Sab
On 11-Feb-2016 5:46 am, "Nirav Patel"  wrote:

> In Yarn we have following settings enabled so that job can use virtual
> memory to have a capacity beyond physical memory off course.
>
> 
> yarn.nodemanager.vmem-check-enabled
> false
> 
>
> 
> yarn.nodemanager.pmem-check-enabled
> false
> 
>
> vmem to pmem ration is 2:1. However spark doesn't seem to be able to
> utilize this vmem limits
> we are getting following heap space error which seemed to be contained
> within spark executor.
>
> 16/02/09 23:08:06 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM
> 16/02/09 23:08:06 ERROR executor.Executor: Exception in task 4.0 in stage
> 7.6 (TID 22363)
> java.lang.OutOfMemoryError: Java heap space
> at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
> at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
> at
> org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
> at
> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203)
> at
> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202)
> at
> org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
> at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
> at
> org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
> at
> org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
> at
> org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
>
>
> Yarn resource manager doesn't give any indication that whether container
> ran out of phycial or virtual memory limits.
>
> Also how to profile this container memory usage? We know our data is
> skewed so some of the executor will have large data (~2M RDD objects) to
> process. I used following as executorJavaOpts but it doesn't seem to work.
> -XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p'
> -XX:HeapDumpPath=/opt/cores/spark
>
>
>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: coalesce and executor memory

2016-02-14 Thread Christopher Brady


I tried it without the cache, but it didn't change anything. The reason for the 
cache is that other actions will be performed on this RDD, even though it never 
gets that far. 

I can make it work by just increasing the number of partitions, but I was 
hoping to get a better understanding of how Spark works rather that just use 
trial and error every time I hit this issue. 


- Original Message - 
From: silvio.fior...@granturing.com 
To: christopher.br...@oracle.com, ko...@tresata.com 
Cc: user@spark.apache.org 
Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern 
Subject: RE: coalesce and executor memory 





Actually, rereading your email I see you're caching. But ‘cache’ uses 
MEMORY_ONLY. Do you see errors about losing partitions as your job is running? 



Are you sure you need to cache if you're just saving to disk? Can you try the 
coalesce without cache? 






From: Christopher Brady 
Sent: Friday, February 12, 2016 8:34 PM 
To: Koert Kuipers ; Silvio Fiorito 
Cc: user 
Subject: Re: coalesce and executor memory 


Thank you for the responses. The map function just changes the format of the 
record slightly, so I don't think that would be the cause of the memory 
problem. 

So if I have 3 cores per executor, I need to be able to fit 3 partitions per 
executor within whatever I specify for the executor memory? Is there a way I 
can programmatically find a number of partitions I can coalesce down to without 
running out of memory? Is there some documentation where this is explained? 



On 02/12/2016 05:10 PM, Koert Kuipers wrote: 




in spark, every partition needs to fit in the memory available to the core 
processing it. 

as you coalesce you reduce number of partitions, increasing partition size. at 
some point the partition no longer fits in memory. 



On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito < silvio.fior...@granturing.com 
> wrote: 


Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like? 






On 2/12/16, 1:13 PM, "Christopher Brady" < christopher.br...@oracle.com > 
wrote: 

>Can anyone help me understand why using coalesce causes my executors to 
>crash with out of memory? What happens during coalesce that increases 
>memory usage so much? 
> 
>If I do: 
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>everything works fine, but if I do: 
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>my executors crash with out of memory exceptions. 
> 
>Is there any documentation that explains what causes the increased 
>memory requirements with coalesce? It seems to be less of a problem if I 
>coalesce into a larger number of partitions, but I'm not sure why this 
>is. How would I estimate how much additional memory the coalesce requires? 
> 
>Thanks. 
> 
>- 
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>For additional commands, e-mail: user-h...@spark.apache.org 
> 




Re: newbie unable to write to S3 403 forbidden error

2016-02-14 Thread Sabarish Sasidharan
Make sure you are using s3 bucket in same region. Also I would access my
bucket this way s3n://bucketname/foldername.

You can test privileges using the s3 cmd line client.

Also, if you are using instance profiles you don't need to specify access
and secret keys. No harm in specifying though.

Regards
Sab
On 12-Feb-2016 2:46 am, "Andy Davidson" 
wrote:

> I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I
> am using the standalone cluster manager
>
> My java streaming app is not able to write to s3. It appears to be some
> for of permission problem.
>
> Any idea what the problem might be?
>
> I tried use the IAM simulator to test the policy. Everything seems okay.
> Any idea how I can debug this problem?
>
> Thanks in advance
>
> Andy
>
> JavaSparkContext jsc = new JavaSparkContext(conf);
>
> // I did not include the full key in my email
>// the keys do not contain ‘\’
>// these are the keys used to create the cluster. They belong to
> the IAM user andy
>
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJREX"
> );
>
> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> "uBh9v1hdUctI23uvq9qR");
>
>
>
>   private static void saveTweets(JavaDStream jsonTweets, String
> outputURI) {
>
> jsonTweets.foreachRDD(new VoidFunction2() {
>
> private static final long serialVersionUID = 1L;
>
>
> @Override
>
> public void call(JavaRDD rdd, Time time) throws
> Exception {
>
> if(!rdd.isEmpty()) {
>
> // bucket name is ‘com.pws.twitter’ it has a folder ‘json'
>
> String dirPath = "s3n://
> s3-us-west-1.amazonaws.com/com.pws.twitter/*json” *+ "-" + time
> .milliseconds();
>
> rdd.saveAsTextFile(dirPath);
>
> }
>
> }
>
> });
>
>
>
>
> Bucket name : com.pws.titter
> Bucket policy (I replaced the account id)
>
> {
> "Version": "2012-10-17",
> "Id": "Policy1455148808376",
> "Statement": [
> {
> "Sid": "Stmt1455148797805",
> "Effect": "Allow",
> "Principal": {
> "AWS": "arn:aws:iam::123456789012:user/andy"
> },
> "Action": "s3:*",
> "Resource": "arn:aws:s3:::com.pws.twitter/*"
> }
> ]
> }
>
>
>


Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Sabarish Sasidharan
Yes you can look at using the capacity scheduler or the fair scheduler with
YARN. Both allow using full cluster when idle. And both allow considering
cpu plus memory when allocating resources which is sort of necessary with
Spark.

Regards
Sab
On 13-Feb-2016 10:11 pm, "Eugene Morozov" 
wrote:

> Hi,
>
> I have several instances of the same web-service that is running some ML
> algos on Spark (both training and prediction) and do some Spark unrelated
> job. Each web-service instance creates their on JavaSparkContext, thus
> they're seen as separate applications by Spark, thus they're configured
> with separate limits of resources such as cores (I'm not concerned about
> the memory as much as about cores).
>
> With this set up, say 3 web service instances, each of them has just 1/3
> of cores. But it might happen, than only one instance is going to use
> Spark, while others are busy with Spark unrelated. I'd like in this case
> all Spark cores be available for the one that's in need.
>
> Ideally I'd like Spark cores just be available in total and the first app
> who needs it, takes as much as required from the available at the moment.
> Is it possible? I believe Mesos is able to set resources free if they're
> not in use. Is it possible with YARN?
>
> I'd appreciate if you could share your thoughts or experience on the
> subject.
>
> Thanks.
> --
> Be well!
> Jean Morozov
>


IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Divya Gehlot
Hi,

I am starting spark-shell with following options :
spark-shell --properties-file  /TestDivya/Spark/Oracle.properties --jars
/usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path
/usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages
com.databricks:spark-csv_2.10:1.1.0  --master yarn-client --num-executors
10 --executor-cores 4 -i /TestDivya/Spark/Test.scala

Got few queries :
1.Error :
java.lang.IllegalStateException: SparkContext has been shutdown

If I remove --executor-cores 4 .. It runs smoothly

2. with --num-executors 10 my spark job takes more time .
 May I know why ?

3. Whats the difference between spark-shell and spark-submit

I am new bee to Spark ..Apologies for such naive questions.
Just  trying to figure out how to tune spark jobs to increase performance
on Hadoop cluster on EC2.
If anybody has real time experience ,please help me.


Thanks,
Divya


Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Sathish Kumaran Vairavelu
--jars takes comma separated values.

On Sun, Feb 14, 2016 at 5:35 PM Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Is there anyway one can pass multiple --driver-class-path and multiple
> –jars to spark shell.
>
>
>
> For example something as below with two jar files entries for Oracle
> (ojdbc6.jar) and Sybase IQ (jcoon4,jar)
>
>
>
> spark-shell --master spark://50.140.197.217:7077 --driver-class-path
> /home/hduser/jars/jconn4.jar  --driver-class-path
> /home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars
> /home/hduser/jars/ojdbc6.jar
>
>
>
>
>
> This works for one jar file only and you need to add the jar file to both 
> --driver-class-path
> and –jars. I have not managed to work for more than one type of JAR file
>
>
>
> Thanks,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>


RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
Thanks very much Sab that did the trick.

 

I can join a FACT table from Hive (ORC partitioned + bucketed) with dimension 
tables from Oracle 

 

Sounds like HiveContext  is a superset of SQLContext

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Sabarish Sasidharan [mailto:sabarish.sasidha...@manthan.com] 
Sent: 14 February 2016 23:37
To: Mich Talebzadeh 
Cc: user 
Subject: RE: Trying to join a registered Hive table as temporary with two 
Oracle tables registered as temporary in Spark

 

The Hive context can be used instead of sql context even when you are accessing 
data from non-Hive sources like mysql or postgres for ex.  It has better sql 
support than the sqlcontext as it uses the HiveQL parser.

Regards
Sab

On 15-Feb-2016 3:07 am, "Mich Talebzadeh"  > wrote:

Thanks.

 

I tried to access Hive table via JDBC (it works) through sqlContext

 

 

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@4f60415b 
 

 

scala> val s = sqlContext.load("jdbc",

 | Map("url" -> "jdbc:hive2://rhes564:10010/oraclehadoop",

 | "dbtable" -> "SALES",

 | "user" -> "hduser",

 | "password" -> "xxx"))

warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details

java.sql.SQLException: Method not supported

 

In general one should expect this to work

 

The attraction of Spark is to cache these tables in memory via registering them 
as temporary tables and do the queries there.

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: ayan guha [mailto:guha.a...@gmail.com  ] 
Sent: 14 February 2016 21:07
To: Mich Talebzadeh  >
Cc: user  >
Subject: Re: Trying to join a registered Hive table as temporary with two 
Oracle tables registered as temporary in Spark

 

Why can't you use the jdbc in hive context? I don't think sharing data across 
contexts are allowed. 

On 15 Feb 2016 07:22, "Mich Talebzadeh"  > wrote:

I am intending to get a table from Hive and register it as temporary table in 
Spark.

 

I have created contexts for both Hive and Spark as below

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

//

 

I get the Hive table as below using HiveContext

 

//Get the FACT table from Hive

//

var s = hiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
oraclehadoop.sales")

 

s.registerTempTable("t_s")

 

This works fine using HiveContext

 

scala> hiveContext.sql("select count(1) from t_s").collect.foreach(println)

[4991761]

 

Now I use JDBC to get data from two Oracle tables and registar them as 
temporary tables using sqlContext

 

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM 
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM 
sh.times)",

"user" -> "sh",

"password" -> "sxxx"))

 

And register them as temporary tables

 

c.registerTempTable("t_c")


RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Sabarish Sasidharan
The Hive context can be used instead of sql context even when you are
accessing data from non-Hive sources like mysql or postgres for ex.  It has
better sql support than the sqlcontext as it uses the HiveQL parser.

Regards
Sab
On 15-Feb-2016 3:07 am, "Mich Talebzadeh"  wrote:

> Thanks.
>
>
>
> I tried to access Hive table via JDBC (it works) through sqlContext
>
>
>
>
>
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> sqlContext: org.apache.spark.sql.SQLContext =
> org.apache.spark.sql.SQLContext@4f60415b
>
>
>
> scala> val s = sqlContext.load("jdbc",
>
>  | Map("url" -> "jdbc:hive2://rhes564:10010/oraclehadoop",
>
>  | "dbtable" -> "SALES",
>
>  | "user" -> "hduser",
>
>  | "password" -> "xxx"))
>
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
>
> *java.sql.SQLException: Method not supported*
>
>
>
> In general one should expect this to work
>
>
>
> The attraction of Spark is to cache these tables in memory via registering
> them as temporary tables and do the queries there.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* 14 February 2016 21:07
> *To:* Mich Talebzadeh 
> *Cc:* user 
> *Subject:* Re: Trying to join a registered Hive table as temporary with
> two Oracle tables registered as temporary in Spark
>
>
>
> Why can't you use the jdbc in hive context? I don't think sharing data
> across contexts are allowed.
>
> On 15 Feb 2016 07:22, "Mich Talebzadeh"  wrote:
>
> I am intending to get a table from Hive and register it as temporary table
> in Spark.
>
>
>
> I have created contexts for both Hive and Spark as below
>
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> //
>
>
>
> I get the Hive table as below using HiveContext
>
>
>
> //Get the FACT table from Hive
>
> //
>
> var s = hiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
> oraclehadoop.sales")
>
>
>
> s.registerTempTable("t_s")
>
>
>
> This works fine using HiveContext
>
>
>
> scala> hiveContext.sql("select count(1) from
> t_s").collect.foreach(println)
>
> [4991761]
>
>
>
> Now I use JDBC to get data from two Oracle tables and registar them as
> temporary tables using sqlContext
>
>
>
> val c = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
> sh.channels)",
>
> "user" -> "sh",
>
> "password" -> "xxx"))
>
>
>
> val t = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC
> FROM sh.times)",
>
> "user" -> "sh",
>
> "password" -> "sxxx"))
>
>
>
> And register them as temporary tables
>
>
>
> c.registerTempTable("t_c")
>
> t.registerTempTable("t_t")
>
> //
>
>
>
> Now trying to do SQL on three tables using sqlContext. However it cannot
> see the hive table
>
>
>
> var sqltext : String = ""
>
> sqltext = """
>
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
>
> FROM
>
> (
>
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
>
>
> sqlContext.sql(sqltext).collect.foreach(println)
>
>
>
> *org.apache.spark.sql.AnalysisException: no such table t_s; line 5 pos 10*
>
>
>
> I guess this is due to two  different Data Frame used. Is there any
> solution? For example can I transorm from HiveContext to sqlContext?
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary 

Passing multiple jar files to spark-shell

2016-02-14 Thread Mich Talebzadeh
Hi,

 

Is there anyway one can pass multiple --driver-class-path and multiple -jars
to spark shell.

 

For example something as below with two jar files entries for Oracle
(ojdbc6.jar) and Sybase IQ (jcoon4,jar)

 

spark-shell --master spark://50.140.197.217:7077 --driver-class-path
/home/hduser/jars/jconn4.jar  --driver-class-path
/home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars
/home/hduser/jars/ojdbc6.jar

 

 

This works for one jar file only and you need to add the jar file to both
--driver-class-path and -jars. I have not managed to work for more than one
type of JAR file

 

Thanks,

 

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
I am intending to get a table from Hive and register it as temporary table
in Spark.

 

I have created contexts for both Hive and Spark as below

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

//

 

I get the Hive table as below using HiveContext

 

//Get the FACT table from Hive

//

var s = hiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
oraclehadoop.sales")

 

s.registerTempTable("t_s")

 

This works fine using HiveContext

 

scala> hiveContext.sql("select count(1) from t_s").collect.foreach(println)

[4991761]

 

Now I use JDBC to get data from two Oracle tables and registar them as
temporary tables using sqlContext

 

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM
sh.times)",

"user" -> "sh",

"password" -> "sxxx"))

 

And register them as temporary tables

 

c.registerTempTable("t_c")

t.registerTempTable("t_t")

//

 

Now trying to do SQL on three tables using sqlContext. However it cannot see
the hive table 

 

var sqltext : String = ""

sqltext = """

SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)

FROM

(

SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
SUM(t_s.AMOUNT_SOLD) AS TotalSales

FROM t_s, t_t, t_c

WHERE t_s.TIME_ID = t_t.TIME_ID

AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID

GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

) rs

LIMIT 10

 

 

sqlContext.sql(sqltext).collect.foreach(println)

 

org.apache.spark.sql.AnalysisException: no such table t_s; line 5 pos 10

 

I guess this is due to two  different Data Frame used. Is there any
solution? For example can I transorm from HiveContext to sqlContext?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



Re: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread ayan guha
Why can't you use the jdbc in hive context? I don't think sharing data
across contexts are allowed.
On 15 Feb 2016 07:22, "Mich Talebzadeh"  wrote:

> I am intending to get a table from Hive and register it as temporary table
> in Spark.
>
>
>
> I have created contexts for both Hive and Spark as below
>
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> //
>
>
>
> I get the Hive table as below using HiveContext
>
>
>
> //Get the FACT table from Hive
>
> //
>
> var s = hiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
> oraclehadoop.sales")
>
>
>
> s.registerTempTable("t_s")
>
>
>
> This works fine using HiveContext
>
>
>
> scala> hiveContext.sql("select count(1) from
> t_s").collect.foreach(println)
>
> [4991761]
>
>
>
> Now I use JDBC to get data from two Oracle tables and registar them as
> temporary tables using sqlContext
>
>
>
> val c = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
> sh.channels)",
>
> "user" -> "sh",
>
> "password" -> "xxx"))
>
>
>
> val t = sqlContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC
> FROM sh.times)",
>
> "user" -> "sh",
>
> "password" -> "sxxx"))
>
>
>
> And register them as temporary tables
>
>
>
> c.registerTempTable("t_c")
>
> t.registerTempTable("t_t")
>
> //
>
>
>
> Now trying to do SQL on three tables using sqlContext. However it cannot
> see the hive table
>
>
>
> var sqltext : String = ""
>
> sqltext = """
>
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
>
> FROM
>
> (
>
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
>
>
> sqlContext.sql(sqltext).collect.foreach(println)
>
>
>
> *org.apache.spark.sql.AnalysisException: no such table t_s; line 5 pos 10*
>
>
>
> I guess this is due to two  different Data Frame used. Is there any
> solution? For example can I transorm from HiveContext to sqlContext?
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>


Running synchronized JRI code

2016-02-14 Thread Simon Hafner
Hello

I'm currently running R code in an executor via JRI. Because R is
single-threaded, any call to R needs to be wrapped in a
`synchronized`. Now I can use a bit more than one core per executor,
which is undesirable. Is there a way to tell spark that this specific
application (or even specific UDF) needs multiple JVMs? Or should I
switch from JRI to a pipe-based (slower) setup?

Cheers,
Simon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread ayan guha
Have you tried repartition to larger number of partitions? Also, I would
suggest increase number of executors and give them smaller amount of memory
each.
On 15 Feb 2016 06:49, "gustavolacerdas"  wrote:

> I have a machine with 96GB and 24 cores. I'm trying to run a k-means
> algorithm with 30GB of input data. My spark-defaults.conf file are
> configured like that: spark.driver.memory 80g spark.executor.memory 70g
> spark.network.timeout 1200s spark.rdd.compress true
> spark.broadcast.compress true But i always get this error Spark Error: Not
> enough space to cache partition rdd I changed the k-means code to persist
> the rdd.cache(MEMORY_AND_DISK) but didn't work.
> --
> View this message in context: Spark Error: Not enough space to cache
> partition rdd
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Spark Certification

2016-02-14 Thread ayan guha
Thanks. Do we have any forum or study group for certification aspirants? I
would like to join.
On 15 Feb 2016 05:53, "Olivier Girardot" 
wrote:

> It does not contain (as of yet) anything > 1.3 (for example in depth
> knowledge of the Dataframe API)
> but you need to know about all the modules (Core, Streaming, SQL, MLLib,
> GraphX)
>
> Regards,
>
> Olivier.
>
> 2016-02-11 19:31 GMT+01:00 Prem Sure :
>
>> I did recently. it includes MLib & Graphx too and I felt like exam
>> content covered all topics till 1.3 and not the > 1.3 versions of spark.
>>
>>
>> On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri 
>> wrote:
>>
>>> I am planning to do that with databricks
>>> http://go.databricks.com/spark-certified-developer
>>>
>>> Regards,
>>> Janardhan
>>>
>>> On Thu, Feb 11, 2016 at 2:00 PM, Timothy Spann 
>>> wrote:
>>>
 I was wondering that as well.

 Also is it fully updated for 1.6?

 Tim
 http://airisdata.com/
 http://sparkdeveloper.com/


 From: naga sharathrayapati 
 Date: Wednesday, February 10, 2016 at 11:36 PM
 To: "user@spark.apache.org" 
 Subject: Spark Certification

 Hello All,

 I am planning on taking Spark Certification and I was wondering If one
 has to be well equipped with  MLib & GraphX as well or not ?

 Please advise

 Thanks

>>>
>>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>


RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
Thanks.

 

I tried to access Hive table via JDBC (it works) through sqlContext

 

 

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@4f60415b

 

scala> val s = sqlContext.load("jdbc",

 | Map("url" -> "jdbc:hive2://rhes564:10010/oraclehadoop",

 | "dbtable" -> "SALES",

 | "user" -> "hduser",

 | "password" -> "xxx"))

warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details

java.sql.SQLException: Method not supported

 

In general one should expect this to work

 

The attraction of Spark is to cache these tables in memory via registering them 
as temporary tables and do the queries there.

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: ayan guha [mailto:guha.a...@gmail.com] 
Sent: 14 February 2016 21:07
To: Mich Talebzadeh 
Cc: user 
Subject: Re: Trying to join a registered Hive table as temporary with two 
Oracle tables registered as temporary in Spark

 

Why can't you use the jdbc in hive context? I don't think sharing data across 
contexts are allowed. 

On 15 Feb 2016 07:22, "Mich Talebzadeh"  > wrote:

I am intending to get a table from Hive and register it as temporary table in 
Spark.

 

I have created contexts for both Hive and Spark as below

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

//

 

I get the Hive table as below using HiveContext

 

//Get the FACT table from Hive

//

var s = hiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
oraclehadoop.sales")

 

s.registerTempTable("t_s")

 

This works fine using HiveContext

 

scala> hiveContext.sql("select count(1) from t_s").collect.foreach(println)

[4991761]

 

Now I use JDBC to get data from two Oracle tables and registar them as 
temporary tables using sqlContext

 

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM 
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM 
sh.times)",

"user" -> "sh",

"password" -> "sxxx"))

 

And register them as temporary tables

 

c.registerTempTable("t_c")

t.registerTempTable("t_t")

//

 

Now trying to do SQL on three tables using sqlContext. However it cannot see 
the hive table 

 

var sqltext : String = ""

sqltext = """

SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)

FROM

(

SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, 
SUM(t_s.AMOUNT_SOLD) AS TotalSales

FROM t_s, t_t, t_c

WHERE t_s.TIME_ID = t_t.TIME_ID

AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID

GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

) rs

LIMIT 10

 

 

sqlContext.sql(sqltext).collect.foreach(println)

 

org.apache.spark.sql.AnalysisException: no such table t_s; line 5 pos 10

 

I guess this is due to two  different Data Frame used. Is there any solution? 
For example can I transorm from HiveContext to sqlContext?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 



Re: Spark Certification

2016-02-14 Thread Olivier Girardot
It does not contain (as of yet) anything > 1.3 (for example in depth
knowledge of the Dataframe API)
but you need to know about all the modules (Core, Streaming, SQL, MLLib,
GraphX)

Regards,

Olivier.

2016-02-11 19:31 GMT+01:00 Prem Sure :

> I did recently. it includes MLib & Graphx too and I felt like exam content
> covered all topics till 1.3 and not the > 1.3 versions of spark.
>
>
> On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri 
> wrote:
>
>> I am planning to do that with databricks
>> http://go.databricks.com/spark-certified-developer
>>
>> Regards,
>> Janardhan
>>
>> On Thu, Feb 11, 2016 at 2:00 PM, Timothy Spann 
>> wrote:
>>
>>> I was wondering that as well.
>>>
>>> Also is it fully updated for 1.6?
>>>
>>> Tim
>>> http://airisdata.com/
>>> http://sparkdeveloper.com/
>>>
>>>
>>> From: naga sharathrayapati 
>>> Date: Wednesday, February 10, 2016 at 11:36 PM
>>> To: "user@spark.apache.org" 
>>> Subject: Spark Certification
>>>
>>> Hello All,
>>>
>>> I am planning on taking Spark Certification and I was wondering If one
>>> has to be well equipped with  MLib & GraphX as well or not ?
>>>
>>> Please advise
>>>
>>> Thanks
>>>
>>
>>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread gustavolacerdas
I have a machine with 96GB and 24 cores.I'm trying to run a k-means algorithm
with 30GB of input data.My spark-defaults.conf file are configured like
that:spark.driver.memory 80gspark.executor.memory
70gspark.network.timeout   1200sspark.rdd.compress  
truespark.broadcast.compress  trueBut i always get this error Spark
Error: Not enough space to cache partition rddI changed the k-means code to
persist the rdd.cache(MEMORY_AND_DISK) but didn't work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Error-Not-enough-space-to-cache-partition-rdd-tp26222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Koert Kuipers
great, by adding a little implicit wrapper i can use algebird's
MonoidAggregator, which gives me the equivalent of GroupedDataset.mapValues
(by using Aggregator.composePrepare)

i am a little surprised you require a monoid and not just a semiring. but
probably the right choice given possibly empty datasets.

i do seem the be running into SPARK-12696
 for some aggregators so
will wait for spark 1.6.1

also i am having no luck using the aggregators with DataFrame instead of
Dataset. for example:

lazy val ds1 = sc.makeRDD(List(("a", 1), ("a", 2), ("a", 3))).toDS
ds1.toDF.groupBy($"_1").agg(Aggregator.toList[(String, Int)]).show

gives me:
[info]   org.apache.spark.sql.AnalysisException: unresolved operator
'Aggregate [_1#50],
[_1#50,(AggregatorAdapter(),mode=Complete,isDistinct=false) AS
AggregatorAdapter()#61];
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
[info]   at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
[info]   at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
[info]   at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
[info]   at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
[info]   at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
[info]   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
[info]   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)

my motivation for trying the DataFrame version is that it takes in
unlimited aggregators with:
GroupedData.def agg(expr: Column, exprs: Column*): DataFrame
there is no equivalent in GroupedDataset.


On Sun, Feb 14, 2016 at 12:31 AM, Michael Armbrust 
wrote:

> Instead of grouping with a lambda function, you can do it with a column
> expression to avoid materializing an unnecessary tuple:
>
> df.groupBy($"_1")
>
> Regarding the mapValues, you can do something similar using an Aggregator
> ,
> but I agree that we should consider something lighter weight like the
> mapValues you propose.
>
> On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers  wrote:
>
>> i have a Dataset[(K, V)]
>> i would like to group by k and then reduce V using a function (V, V) => V
>> how do i do this?
>>
>> i would expect something like:
>> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
>> or better:
>> ds.grouped.reduce(f)  # grouped only works on Dataset[(_, _)] and i dont
>> care about java api
>>
>> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a
>> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key
>> through the reduce operation but that seems ugly and inefficient.
>>
>> any thoughts?
>>
>>
>>
>


Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Olivier Girardot
you can also activate detail GC prints to get more infos

2016-02-11 7:43 GMT+01:00 Shiva Ramagopal :

> How are you submitting/running the job - via spark-submit or as a plain
> old Java program?
>
> If you are using spark-submit, you can control the memory setting via the
> configuration parameter spark.executor.memory in spark-defaults.conf.
>
> If you are running it as a Java program, use -Xmx to set the maximum heap
> size.
>
> On Thu, Feb 11, 2016 at 5:46 AM, Nirav Patel 
> wrote:
>
>> In Yarn we have following settings enabled so that job can use virtual
>> memory to have a capacity beyond physical memory off course.
>>
>> 
>> yarn.nodemanager.vmem-check-enabled
>> false
>> 
>>
>> 
>> yarn.nodemanager.pmem-check-enabled
>> false
>> 
>>
>> vmem to pmem ration is 2:1. However spark doesn't seem to be able to
>> utilize this vmem limits
>> we are getting following heap space error which seemed to be contained
>> within spark executor.
>>
>> 16/02/09 23:08:06 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
>> SIGNAL 15: SIGTERM
>> 16/02/09 23:08:06 ERROR executor.Executor: Exception in task 4.0 in stage
>> 7.6 (TID 22363)
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
>> at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
>> at
>> org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
>> at
>> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203)
>> at
>> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202)
>> at
>> org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
>> at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
>> at
>> org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
>> at
>> org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
>> at
>> org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
>> at
>> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>>
>>
>> Yarn resource manager doesn't give any indication that whether container
>> ran out of phycial or virtual memory limits.
>>
>> Also how to profile this container memory usage? We know our data is
>> skewed so some of the executor will have large data (~2M RDD objects) to
>> process. I used following as executorJavaOpts but it doesn't seem to work.
>> -XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p'
>> -XX:HeapDumpPath=/opt/cores/spark
>>
>>
>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


RE: using udf to convert Oracle number column in Data Frame

2016-02-14 Thread Mich Talebzadeh
Hi Ted,

Thanks for this. If generic functions exist then they are always faster and 
more efficient than UDFs from my experience. For example writing a UDF to do 
standard deviation in Oracle(nned this one for Oracle TimesTen IMDB)  turned 
out not to be any quick compared to Oracle’s own function STDDEV()

 

Unfortunately all columns defined as NUMBER, NUMBER(10,2) etc cause overflow in 
spark. However, they map fine in Hive using BigInt or NUMERIC(10,2)

 

So basically in the JDBC  I used Oracle to_CHAR()  function to convert these 
into strings and it seems to be OK as TO_CHAR( ) is a generic Oracle function 
and not UDF.

 

 

Thanks again

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: 13 February 2016 18:36
To: Mich Talebzadeh 
Cc: user 
Subject: Re: using udf to convert Oracle number column in Data Frame

 

Please take a look at 
sql/core/src/main/scala/org/apache/spark/sql/functions.scala :

 

  def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {

UserDefinedFunction(f, dataType, None)

 

And sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala :

 

  test("udf") {

val foo = udf((a: Int, b: String) => a.toString + b)

 

checkAnswer(

  // SELECT *, foo(key, value) FROM testData

  testData.select($"*", foo('key, 'value)).limit(3),

 

Cheers

 

On Sat, Feb 13, 2016 at 9:55 AM, Mich Talebzadeh  > wrote:

Hi,

 

 

Unfortunately Oracle table columns defined as NUMBER result in overflow.

 

An alternative seems to be to create a UDF to map that column to Double

 

val toDouble = udf((d: java.math.BigDecimal) => d.toString.toDouble)

 

 

This is the DF I have defined to fetch one column as below from the Oracle table

 

  val s = sqlContext.load("jdbc",

 Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

  "dbtable" -> "(select PROD_ID from sh.sales)",

  "user" -> "sh",

"password" -> "x"))

 

This obviously works

 

scala> s.count

res13: Long = 918843

 

Now the question is how to use that UDF toDouble to read column PROD_ID? Do I 
need to create a temporary table? 

 

 

Thanks

 

Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

 



RE: Joining three tables with data frames

2016-02-14 Thread Mich Talebzadeh
Thanks Jeff,

 

I registered the three data frames as temporary tables and performed the SQL 
query directly on them. I had to convert the oracle NUMBER and NUMBER(n,m) 
columns to TO_CHAR() at the query level to avoid the overflows.

 

I think the fact that we can read data from JDBC databases and use Spark 
in-memory capabilities to do heterogeneous queries involving different tables 
on different databases is potentially very useful 

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 

//

val s = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(AMOUNT_SOLD) AS AMOUNT_SOLD, to_char(TIME_ID) AS 
TIME_ID, to_char(CHANNEL_ID) AS CHANNEL_ID FROM sh.sales)",

"user" -> "sh",

"password" -> "xxx"))

//

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM 
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM 
sh.times)",

"user" -> "sh",

"password" -> "xxx"))

//

// Registar three data frames as temporary tables

//

s.registerTempTable("t_s")

c.registerTempTable("t_c")

t.registerTempTable("t_t")

//

var sqltext : String = ""

sqltext = """

SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)

FROM

(

SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, 
SUM(t_s.AMOUNT_SOLD) AS TotalSales

FROM t_s, t_t, t_c

WHERE t_s.TIME_ID = t_t.TIME_ID

AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID

GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

) rs

LIMIT 10

"""

sqlContext.sql(sqltext).collect.foreach(println)

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Jeff Zhang [mailto:zjf...@gmail.com] 
Sent: 14 February 2016 01:39
To: Mich Talebzadeh 
Cc: user@spark.apache.org
Subject: Re: Joining three tables with data frames

 

What do you mean "does not work" ? What's the error message ? BTW would it be 
simpler that register the 3 data frames as temporary table and then use the sql 
query you used before in hive and oracle ?

 

On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh  > wrote:

Hi,

 

I have created DFs on three Oracle tables.

 

The join in Hive and Oracle are pretty simple

 

SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales

FROM sales s, times t, channels c

WHERE s.time_id = t.time_id

AND   s.channel_id = c.channel_id

GROUP BY t.calendar_month_desc, c.channel_desc

;

 

I try to do this using Data Framess

 

 

import org.apache.spark.sql.functions._

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

//

val s = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.sales)",

"user" -> "sh",

"password" -> "x"))

//

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.channels)",

"user" -> "sh",

"password" -> "x"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.times)",

"user" -> "sh",

"password" -> "x"))

//

val sc = s.join(c, s.col("CHANNEL_ID") === c.col("CHANNEL_ID"))

val st = s.join(t, s.col("TIME_ID") === t.col("TIME_ID"))

 

val rs = sc.join(st)

 

rs.groupBy($"calendar_month_desc",$"channel_desc").agg(sum($"amount_sold"))

 

The las result set (rs) does not work. 

 

Since data is imported then I assume that the columns for joins need to be 
defined in data frame for each table rather than importing the whole columns. 

 

Thanks,

 

 

Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be 

RE: coalesce and executor memory

2016-02-14 Thread Silvio Fiorito
Actually, rereading your email I see you're caching. But ‘cache’ uses 
MEMORY_ONLY. Do you see errors about losing partitions as your job is running?

Are you sure you need to cache if you're just saving to disk? Can you try the 
coalesce without cache?


From: Christopher Brady
Sent: Friday, February 12, 2016 8:34 PM
To: Koert Kuipers; Silvio 
Fiorito
Cc: user
Subject: Re: coalesce and executor memory

Thank you for the responses. The map function just changes the format of the 
record slightly, so I don't think that would be the cause of the memory problem.

So if I have 3 cores per executor, I need to be able to fit 3 partitions per 
executor within whatever I specify for the executor memory? Is there a way I 
can programmatically find a number of partitions I can coalesce down to without 
running out of memory? Is there some documentation where this is explained?


On 02/12/2016 05:10 PM, Koert Kuipers wrote:
in spark, every partition needs to fit in the memory available to the core 
processing it.

as you coalesce you reduce number of partitions, increasing partition size. at 
some point the partition no longer fits in memory.

On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito 
> wrote:
Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady" 
<christopher.br...@oracle.com>
 wrote:

>Can anyone help me understand why using coalesce causes my executors to
>crash with out of memory? What happens during coalesce that increases
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased
>memory requirements with coalesce? It seems to be less of a problem if I
>coalesce into a larger number of partitions, but I'm not sure why this
>is. How would I estimate how much additional memory the coalesce requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: 
>user-unsubscr...@spark.apache.org
>For additional commands, e-mail:  
>user-h...@spark.apache.org
>




Re: GroupedDataset needs a mapValues

2016-02-14 Thread Andy Davidson
Hi Michael

From:  Michael Armbrust 
Date:  Saturday, February 13, 2016 at 9:31 PM
To:  Koert Kuipers 
Cc:  "user @spark" 
Subject:  Re: GroupedDataset needs a mapValues

> Instead of grouping with a lambda function, you can do it with a column
> expression to avoid materializing an unnecessary tuple:
> 
> df.groupBy($"_1")


I am unfamiliar with this notation. Is there something similar for Java and
python?

Kind regards

Andy


> 
> Regarding the mapValues, you can do something similar using an Aggregator
>  20Aggregator.html> , but I agree that we should consider something lighter
> weight like the mapValues you propose.
> 
> On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers  wrote:
>> i have a Dataset[(K, V)]
>> i would like to group by k and then reduce V using a function (V, V) => V
>> how do i do this?
>> 
>> i would expect something like:
>> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
>> or better:
>> ds.grouped.reduce(f)  # grouped only works on Dataset[(_, _)] and i dont care
>> about java api
>> 
>> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a
>> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key
>> through the reduce operation but that seems ugly and inefficient.
>> 
>> any thoughts?
>> 
>> 
> 




Re: Imported CSV file content isn't identical to the original file

2016-02-14 Thread SLiZn Liu
This Error message does not appear as I upgraded to 1.6.0 .

--
Cheers,
Todd Leo

On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu  wrote:

> At least works for me though, temporarily disabled Kyro serilizer until
> upgrade to 1.6.0. Appreciate for your update. :)
> Luciano Resende 于2016年2月9日 周二02:37写道:
>
>> Sorry, same expected results with trunk and Kryo serializer
>>
>> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu  wrote:
>>
>>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>>> by spark-submit with --conf
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>>> DataFrame content goes wrong, as I described earlier.
>>> ​
>>>
>>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu  wrote:
>>>
 Thanks Luciano, now it looks like I’m the only guy who have this issue.
 My options is narrowed down to upgrade my spark to 1.6.0, to see if this
 issue is gone.

 —
 Cheers,
 Todd Leo


 ​
 On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende 
 wrote:

> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
> columns seem to be read properly.
>
>  +--+--+
> |C0|C1|
> +--+--+
>
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566430 | 2015-11-0400:00:30|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> |1446566431 | 2015-11-0400:00:31|
> +--+--+
>
>
>
>
> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu 
> wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566430 2015-11-0400:00:30
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>> 1446566431 2015-11-0400:00:31
>>
>> the  in column 2 represents sub-delimiter within that column,
>> and this file is stored on HDFS, let’s say the path is
>> hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>> .option("header", "false") // Use first line of all files as 
>> header
>> .option("inferSchema", "false") // Automatically infer data types
>> .option("delimiter", " ")
>> .load("hdfs:///tmp/1.csv")
>> .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also
>> used the other way to read csv file, by sc.textFile(...).map(_.split("
>> ")) and sqlContext.createDataFrame, and the result is the same. Can
>> someone point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> ​
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>


Re: Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-14 Thread Yuval.Itzchakov
Your question lacks sufficient information for us to actually provide help.
Have you looked at the Spark UI to see which part of the graph is taking the
longest? Have you tried logging your methods?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jobs-run-extremely-slow-on-yarn-cluster-compared-to-standalone-spark-tp26215p26221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using explain plan to optimize sql query

2016-02-14 Thread Mr rty ff
HiI have some queries that take a long time to execute so I used an 
df.explain(true) to print physical and logical plans to see where the 
bottlenecks.As the query is very complicated I got a very unreadable  
result.How can I parse it to some thing more readable and  analyze it?And 
another question when analyzing execution plan to what should I pay attention?
Thanks