How to kill the spark job using Java API.

2015-11-20 Thread Hokam Singh Chauhan
Hi,

I have been running the spark job on standalone spark cluster. I wants to
kill the spark job using Java API. I am having the spark job name and spark
job id.

The REST POST call for killing the job is not working.

If anyone explored it please help me out.

-- 
Thanks and Regards,
Hokam Singh Chauhan
Mobile : 09407125190


Error in Saving the MLlib models

2015-11-20 Thread hokam chauhan
Hi,

I am exploring the MLlib. I have taken the examples of the MLlib and tried
to train a SVM Model. I am getting the exception when i am saving the
trained model.As i run the code in local mode it works fine, but when i run
the MLlib example in standalone cluster mode it fails to save the Model.

Below line of code is giving the exception when training the SVM model in
standalone cluster mode.
model.save(sc, "/data/mymodel");

If anyone tried such scenario, please help me out.

Thanks,
Hokam 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-Saving-the-MLlib-models-tp25443.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread varun sharma
I do this in my stop script to kill the application: kill -s SIGTERM `pgrep
-f StreamingApp`
to stop it forcefully : pkill -9 -f "StreamingApp"
StreamingApp is name of class which I submitted.

I also have shutdown hook thread to stop it gracefully.

sys.ShutdownHookThread {
  logInfo("Gracefully stopping StreamingApp")
  ssc.stop(true, true)
  logInfo("StreamingApp stopped")
}

I am also not able to kill application from sparkUI.


On Sat, Nov 21, 2015 at 11:32 AM, Vikram Kone  wrote:

> I tried adding shutdown hook to my code but it didn't help. Still same
> issue
>
>
> On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu  wrote:
>
>> Which Spark release are you using ?
>>
>> Can you pastebin the stack trace of the process running on your machine ?
>>
>> Thanks
>>
>> On Nov 20, 2015, at 6:46 PM, Vikram Kone  wrote:
>>
>> Hi,
>> I'm seeing a strange problem. I have a spark cluster in standalone mode.
>> I submit spark jobs from a remote node as follows from the terminal
>>
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>>
>> when the app is running , when I press ctrl-C on the console terminal,
>> then the process is killed and so is the app in the spark master UI. When I
>> go to spark master ui, i see that this app is in state Killed under
>> Completed applications, which is what I expected to see.
>>
>> Now, I created a shell script as follows to do the same
>>
>> #!/bin/bash
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>> echo $! > my.pid
>>
>> When I execute the shell script from terminal, as follows
>>
>> $> bash myscript.sh
>>
>> The application is submitted correctly to spark master and I can see it
>> as one of the running apps in teh spark master ui. But when I kill the
>> process in my terminal as follows
>>
>> $> ps kill $(cat my.pid)
>>
>> I see that the process is killed on my machine but the spark appliation
>> is still running in spark master! It doesn't get killed.
>>
>> I noticed one more thing that, when I launch the spark job via shell
>> script and kill the application from spark master UI by clicking on "kill"
>> next to the running application, it gets killed in spark ui but I still see
>> the process running in my machine.
>>
>> In both cases, I would expect the remote spark app to be killed and my
>> local process to be killed.
>>
>> Why is this happening? and how can I kill a spark app from the terminal
>> launced via shell script w.o going to the spark master UI?
>>
>> I want to launch the spark app via script and log the pid so i can
>> monitor it remotely
>>
>> thanks for the help
>>
>>
>


-- 
*VARUN SHARMA*
*Flipkart*
*Bangalore*


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
I tried adding shutdown hook to my code but it didn't help. Still same issue


On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu  wrote:

> Which Spark release are you using ?
>
> Can you pastebin the stack trace of the process running on your machine ?
>
> Thanks
>
> On Nov 20, 2015, at 6:46 PM, Vikram Kone  wrote:
>
> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I
> submit spark jobs from a remote node as follows from the terminal
>
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
>
> when the app is running , when I press ctrl-C on the console terminal,
> then the process is killed and so is the app in the spark master UI. When I
> go to spark master ui, i see that this app is in state Killed under
> Completed applications, which is what I expected to see.
>
> Now, I created a shell script as follows to do the same
>
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
> echo $! > my.pid
>
> When I execute the shell script from terminal, as follows
>
> $> bash myscript.sh
>
> The application is submitted correctly to spark master and I can see it as
> one of the running apps in teh spark master ui. But when I kill the process
> in my terminal as follows
>
> $> ps kill $(cat my.pid)
>
> I see that the process is killed on my machine but the spark appliation is
> still running in spark master! It doesn't get killed.
>
> I noticed one more thing that, when I launch the spark job via shell
> script and kill the application from spark master UI by clicking on "kill"
> next to the running application, it gets killed in spark ui but I still see
> the process running in my machine.
>
> In both cases, I would expect the remote spark app to be killed and my
> local process to be killed.
>
> Why is this happening? and how can I kill a spark app from the terminal
> launced via shell script w.o going to the spark master UI?
>
> I want to launch the spark app via script and log the pid so i can monitor
> it remotely
>
> thanks for the help
>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Interesting, SPARK-3090 installs shutdown hook for stopping SparkContext.

FYI

On Fri, Nov 20, 2015 at 7:12 PM, Stéphane Verlet 
wrote:

> I solved the first issue by adding a shutdown hook in my code. The
> shutdown hook get call when you exit your script (ctrl-C , kill … but nor
> kill -9)
>
> val shutdownHook = scala.sys.addShutdownHook {
> try {
>
> sparkContext.stop()
> //Make sure to kill any other threads or thread pool you may be running
>   }
>   catch {
> case e: Exception =>
>   {
> ...
>
>   }
>   }
>
> }
>
> For the other issue , kill from the UI. I also had the issue. This was
> caused by a thread pool that I use.
>
> So I surrounded my code with try/finally block to guarantee that the
> thread pool was shutdown when spark stopped
>
> I hopes this help
>
> Stephane
> ​
>
> On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone  wrote:
>
>> Hi,
>> I'm seeing a strange problem. I have a spark cluster in standalone mode.
>> I submit spark jobs from a remote node as follows from the terminal
>>
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>>
>> when the app is running , when I press ctrl-C on the console terminal,
>> then the process is killed and so is the app in the spark master UI. When I
>> go to spark master ui, i see that this app is in state Killed under
>> Completed applications, which is what I expected to see.
>>
>> Now, I created a shell script as follows to do the same
>>
>> #!/bin/bash
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>> echo $! > my.pid
>>
>> When I execute the shell script from terminal, as follows
>>
>> $> bash myscript.sh
>>
>> The application is submitted correctly to spark master and I can see it
>> as one of the running apps in teh spark master ui. But when I kill the
>> process in my terminal as follows
>>
>> $> ps kill $(cat my.pid)
>>
>> I see that the process is killed on my machine but the spark appliation
>> is still running in spark master! It doesn't get killed.
>>
>> I noticed one more thing that, when I launch the spark job via shell
>> script and kill the application from spark master UI by clicking on "kill"
>> next to the running application, it gets killed in spark ui but I still see
>> the process running in my machine.
>>
>> In both cases, I would expect the remote spark app to be killed and my
>> local process to be killed.
>>
>> Why is this happening? and how can I kill a spark app from the terminal
>> launced via shell script w.o going to the spark master UI?
>>
>> I want to launch the spark app via script and log the pid so i can
>> monitor it remotely
>>
>> thanks for the help
>>
>>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I am not sure , I think it has to do with the signal sent to the process
and how the JVM handles it

Ctrl-C sends a a SIGINT vs a TERM signal for the kill command



On Fri, Nov 20, 2015 at 8:21 PM, Vikram Kone  wrote:

> Thanks for the info Stephane.
> Why does CTRL-C in the terminal running spark-submit kills the app in
> spark master correctly w/o any explicit shutdown hooks in the code? Can you
> explain why we need to add the shutdown hook to kill it when launched via a
> shell script ?
> For the second issue, I'm not using any thread pool. So not sure why
> killing the app in spark UI doesn't kill the process launched via script
>
>
> On Friday, November 20, 2015, Stéphane Verlet 
> wrote:
>
>> I solved the first issue by adding a shutdown hook in my code. The
>> shutdown hook get call when you exit your script (ctrl-C , kill … but nor
>> kill -9)
>>
>> val shutdownHook = scala.sys.addShutdownHook {
>> try {
>>
>> sparkContext.stop()
>> //Make sure to kill any other threads or thread pool you may be running
>>   }
>>   catch {
>> case e: Exception =>
>>   {
>> ...
>>
>>   }
>>   }
>>
>> }
>>
>> For the other issue , kill from the UI. I also had the issue. This was
>> caused by a thread pool that I use.
>>
>> So I surrounded my code with try/finally block to guarantee that the
>> thread pool was shutdown when spark stopped
>>
>> I hopes this help
>>
>> Stephane
>> ​
>>
>> On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone 
>> wrote:
>>
>>> Hi,
>>> I'm seeing a strange problem. I have a spark cluster in standalone mode.
>>> I submit spark jobs from a remote node as follows from the terminal
>>>
>>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>>> spark-jobs.jar
>>>
>>> when the app is running , when I press ctrl-C on the console terminal,
>>> then the process is killed and so is the app in the spark master UI. When I
>>> go to spark master ui, i see that this app is in state Killed under
>>> Completed applications, which is what I expected to see.
>>>
>>> Now, I created a shell script as follows to do the same
>>>
>>> #!/bin/bash
>>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>>> spark-jobs.jar
>>> echo $! > my.pid
>>>
>>> When I execute the shell script from terminal, as follows
>>>
>>> $> bash myscript.sh
>>>
>>> The application is submitted correctly to spark master and I can see it
>>> as one of the running apps in teh spark master ui. But when I kill the
>>> process in my terminal as follows
>>>
>>> $> ps kill $(cat my.pid)
>>>
>>> I see that the process is killed on my machine but the spark appliation
>>> is still running in spark master! It doesn't get killed.
>>>
>>> I noticed one more thing that, when I launch the spark job via shell
>>> script and kill the application from spark master UI by clicking on "kill"
>>> next to the running application, it gets killed in spark ui but I still see
>>> the process running in my machine.
>>>
>>> In both cases, I would expect the remote spark app to be killed and my
>>> local process to be killed.
>>>
>>> Why is this happening? and how can I kill a spark app from the terminal
>>> launced via shell script w.o going to the spark master UI?
>>>
>>> I want to launch the spark app via script and log the pid so i can
>>> monitor it remotely
>>>
>>> thanks for the help
>>>
>>>
>>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Thanks for the info Stephane.
Why does CTRL-C in the terminal running spark-submit kills the app in spark
master correctly w/o any explicit shutdown hooks in the code? Can you
explain why we need to add the shutdown hook to kill it when launched via a
shell script ?
For the second issue, I'm not using any thread pool. So not sure why
killing the app in spark UI doesn't kill the process launched via script

On Friday, November 20, 2015, Stéphane Verlet 
wrote:

> I solved the first issue by adding a shutdown hook in my code. The
> shutdown hook get call when you exit your script (ctrl-C , kill … but nor
> kill -9)
>
> val shutdownHook = scala.sys.addShutdownHook {
> try {
>
> sparkContext.stop()
> //Make sure to kill any other threads or thread pool you may be running
>   }
>   catch {
> case e: Exception =>
>   {
> ...
>
>   }
>   }
>
> }
>
> For the other issue , kill from the UI. I also had the issue. This was
> caused by a thread pool that I use.
>
> So I surrounded my code with try/finally block to guarantee that the
> thread pool was shutdown when spark stopped
>
> I hopes this help
>
> Stephane
> ​
>
> On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone  > wrote:
>
>> Hi,
>> I'm seeing a strange problem. I have a spark cluster in standalone mode.
>> I submit spark jobs from a remote node as follows from the terminal
>>
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>>
>> when the app is running , when I press ctrl-C on the console terminal,
>> then the process is killed and so is the app in the spark master UI. When I
>> go to spark master ui, i see that this app is in state Killed under
>> Completed applications, which is what I expected to see.
>>
>> Now, I created a shell script as follows to do the same
>>
>> #!/bin/bash
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>> echo $! > my.pid
>>
>> When I execute the shell script from terminal, as follows
>>
>> $> bash myscript.sh
>>
>> The application is submitted correctly to spark master and I can see it
>> as one of the running apps in teh spark master ui. But when I kill the
>> process in my terminal as follows
>>
>> $> ps kill $(cat my.pid)
>>
>> I see that the process is killed on my machine but the spark appliation
>> is still running in spark master! It doesn't get killed.
>>
>> I noticed one more thing that, when I launch the spark job via shell
>> script and kill the application from spark master UI by clicking on "kill"
>> next to the running application, it gets killed in spark ui but I still see
>> the process running in my machine.
>>
>> In both cases, I would expect the remote spark app to be killed and my
>> local process to be killed.
>>
>> Why is this happening? and how can I kill a spark app from the terminal
>> launced via shell script w.o going to the spark master UI?
>>
>> I want to launch the spark app via script and log the pid so i can
>> monitor it remotely
>>
>> thanks for the help
>>
>>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Stéphane Verlet
I solved the first issue by adding a shutdown hook in my code. The shutdown
hook get call when you exit your script (ctrl-C , kill … but nor kill -9)

val shutdownHook = scala.sys.addShutdownHook {
try {

sparkContext.stop()
//Make sure to kill any other threads or thread pool you may be running
  }
  catch {
case e: Exception =>
  {
...

  }
  }

}

For the other issue , kill from the UI. I also had the issue. This was
caused by a thread pool that I use.

So I surrounded my code with try/finally block to guarantee that the thread
pool was shutdown when spark stopped

I hopes this help

Stephane
​

On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone  wrote:

> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I
> submit spark jobs from a remote node as follows from the terminal
>
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
>
> when the app is running , when I press ctrl-C on the console terminal,
> then the process is killed and so is the app in the spark master UI. When I
> go to spark master ui, i see that this app is in state Killed under
> Completed applications, which is what I expected to see.
>
> Now, I created a shell script as follows to do the same
>
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
> echo $! > my.pid
>
> When I execute the shell script from terminal, as follows
>
> $> bash myscript.sh
>
> The application is submitted correctly to spark master and I can see it as
> one of the running apps in teh spark master ui. But when I kill the process
> in my terminal as follows
>
> $> ps kill $(cat my.pid)
>
> I see that the process is killed on my machine but the spark appliation is
> still running in spark master! It doesn't get killed.
>
> I noticed one more thing that, when I launch the spark job via shell
> script and kill the application from spark master UI by clicking on "kill"
> next to the running application, it gets killed in spark ui but I still see
> the process running in my machine.
>
> In both cases, I would expect the remote spark app to be killed and my
> local process to be killed.
>
> Why is this happening? and how can I kill a spark app from the terminal
> launced via shell script w.o going to the spark master UI?
>
> I want to launch the spark app via script and log the pid so i can monitor
> it remotely
>
> thanks for the help
>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Spark 1.4.1

On Friday, November 20, 2015, Ted Yu  wrote:

> Which Spark release are you using ?
>
> Can you pastebin the stack trace of the process running on your machine ?
>
> Thanks
>
> On Nov 20, 2015, at 6:46 PM, Vikram Kone  > wrote:
>
> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I
> submit spark jobs from a remote node as follows from the terminal
>
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
>
> when the app is running , when I press ctrl-C on the console terminal,
> then the process is killed and so is the app in the spark master UI. When I
> go to spark master ui, i see that this app is in state Killed under
> Completed applications, which is what I expected to see.
>
> Now, I created a shell script as follows to do the same
>
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
> echo $! > my.pid
>
> When I execute the shell script from terminal, as follows
>
> $> bash myscript.sh
>
> The application is submitted correctly to spark master and I can see it as
> one of the running apps in teh spark master ui. But when I kill the process
> in my terminal as follows
>
> $> ps kill $(cat my.pid)
>
> I see that the process is killed on my machine but the spark appliation is
> still running in spark master! It doesn't get killed.
>
> I noticed one more thing that, when I launch the spark job via shell
> script and kill the application from spark master UI by clicking on "kill"
> next to the running application, it gets killed in spark ui but I still see
> the process running in my machine.
>
> In both cases, I would expect the remote spark app to be killed and my
> local process to be killed.
>
> Why is this happening? and how can I kill a spark app from the terminal
> launced via shell script w.o going to the spark master UI?
>
> I want to launch the spark app via script and log the pid so i can monitor
> it remotely
>
> thanks for the help
>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Ted Yu
Which Spark release are you using ?

Can you pastebin the stack trace of the process running on your machine ?

Thanks

> On Nov 20, 2015, at 6:46 PM, Vikram Kone  wrote:
> 
> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I 
> submit spark jobs from a remote node as follows from the terminal
> 
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping 
> spark-jobs.jar
> 
> when the app is running , when I press ctrl-C on the console terminal, then 
> the process is killed and so is the app in the spark master UI. When I go to 
> spark master ui, i see that this app is in state Killed under Completed 
> applications, which is what I expected to see.
> 
> Now, I created a shell script as follows to do the same
> 
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping 
> spark-jobs.jar
> echo $! > my.pid
> 
> When I execute the shell script from terminal, as follows
> 
> $> bash myscript.sh
> 
> The application is submitted correctly to spark master and I can see it as 
> one of the running apps in teh spark master ui. But when I kill the process 
> in my terminal as follows
> 
> $> ps kill $(cat my.pid)
> 
> I see that the process is killed on my machine but the spark appliation is 
> still running in spark master! It doesn't get killed.
> 
> I noticed one more thing that, when I launch the spark job via shell script 
> and kill the application from spark master UI by clicking on "kill" next to 
> the running application, it gets killed in spark ui but I still see the 
> process running in my machine. 
> 
> In both cases, I would expect the remote spark app to be killed and my local 
> process to be killed.
> 
> Why is this happening? and how can I kill a spark app from the terminal 
> launced via shell script w.o going to the spark master UI?
> 
> I want to launch the spark app via script and log the pid so i can monitor it 
> remotely
> 
> thanks for the help
> 


Initial State

2015-11-20 Thread Bryan
All,

Is there a way to introduce an initial RDD without doing updateStateByKey? I 
have an initial set of counts, and the algorithm I am using requires that I 
accumulate additional counts from streaming data, age off older counts, and 
make some calculations on them. The accumulation of counts uses 
reduceByKeyAndWindow. Is there another method to seed in the initial counts 
beyond updateStateByKey?

Regards,

Bryan Jeffrey

How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Hi,
I'm seeing a strange problem. I have a spark cluster in standalone mode. I
submit spark jobs from a remote node as follows from the terminal

spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
spark-jobs.jar

when the app is running , when I press ctrl-C on the console terminal, then
the process is killed and so is the app in the spark master UI. When I go
to spark master ui, i see that this app is in state Killed under Completed
applications, which is what I expected to see.

Now, I created a shell script as follows to do the same

#!/bin/bash
spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
spark-jobs.jar
echo $! > my.pid

When I execute the shell script from terminal, as follows

$> bash myscript.sh

The application is submitted correctly to spark master and I can see it as
one of the running apps in teh spark master ui. But when I kill the process
in my terminal as follows

$> ps kill $(cat my.pid)

I see that the process is killed on my machine but the spark appliation is
still running in spark master! It doesn't get killed.

I noticed one more thing that, when I launch the spark job via shell script
and kill the application from spark master UI by clicking on "kill" next to
the running application, it gets killed in spark ui but I still see the
process running in my machine.

In both cases, I would expect the remote spark app to be killed and my
local process to be killed.

Why is this happening? and how can I kill a spark app from the terminal
launced via shell script w.o going to the spark master UI?

I want to launch the spark app via script and log the pid so i can monitor
it remotely

thanks for the help


Re: updateStateByKey schedule time

2015-11-20 Thread Tathagata Das
For future readers of this thread, Spark 1.6 adds trackStateByKey that has
native support for timeouts.

On Tue, Jul 21, 2015 at 12:00 AM, Anand Nalya  wrote:

> I also ran into a similar use case. Is this possible?
>
> On 15 July 2015 at 18:12, Michel Hubert  wrote:
>
>> Hi,
>>
>>
>>
>>
>>
>> I want to implement a time-out mechanism in de updateStateByKey(…)
>> routine.
>>
>>
>>
>> But is there a way the retrieve the time of the start of the batch
>> corresponding to the call to my updateStateByKey routines?
>>
>>
>>
>> Suppose the streaming has build up some delay then a 
>> System.currentTimeMillis()
>> will not be the time of the time the batch was scheduled.
>>
>>
>>
>> I want to retrieve the job/task schedule time of the batch for which my 
>> updateStateByKey(..)
>> routine is called.
>>
>>
>>
>> Is this possible?
>>
>>
>>
>> With kind regards,
>>
>> Michel Hubert
>>
>>
>>
>>
>>
>
>


Re: Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread Tathagata Das
Good question.

Write Ahead Logs are used to do both - write data and write metadata when
needed. When data wal is enabled using the conf
spark.streaming.receiver.writeAheadLog.enable, data received by receivers
are written to the data WAL by the executors. In some cases, like Direct
Kafka and Kinesis, the record identifies are written to another metadata
WAL by the driver. The metadata WAL is always enabled.

All WALs take care of their cleanup. Spark Streaming knows when things can
be cleaned up (based on what window ops, etc are used in your program). So
you dont have to worry about cleaning up.

On Fri, Nov 20, 2015 at 12:26 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> If write ahead logs are enabled in spark streaming does all the received
> data gets written to HDFS path   ? or it only writes the metadata.
> How does clean up works , does HDFS path gets bigger and bigger  up
> everyday
> do I need to write an clean up job to delete data from  write ahead logs
> folder ?
> what actually does write ahead log folder has ?
>
> Thanks
> Sri
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-streaming-write-ahead-log-writes-all-received-data-to-HDFS-tp25439.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Creating new Spark context when running in Secure YARN fails

2015-11-20 Thread Hari Shreedharan
Can you try this: https://github.com/apache/spark/pull/9875 
. I believe this patch should fix 
the issue here.

Thanks,
Hari Shreedharan




> On Nov 11, 2015, at 1:59 PM, Ted Yu  wrote:
> 
> Please take a look at 
> yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
>  where this config is described
> 
> Cheers
> 
> On Wed, Nov 11, 2015 at 1:45 PM, Michael V Le  > wrote:
> It looks like my config does not have "spark.yarn.credentials.file".
> 
> I executed:
> sc._conf.getAll()
> 
> [(u'spark.ssl.keyStore', u'xxx.keystore'), (u'spark.eventLog.enabled', 
> u'true'), (u'spark.ssl.keyStorePassword', u'XXX'), (u'spark.yarn.principal', 
> u'XXX'), (u'spark.master', u'yarn-client'), (u'spark.ssl.keyPassword', 
> u'XXX'), (u'spark.authenticate.sasl.serverAlwaysEncrypt', u'true'), 
> (u'spark.ssl.trustStorePassword', u'XXX'), (u'spark.ssl.protocol', 
> u'TLSv1.2'), (u'spark.authenticate.enableSaslEncryption', u'true'), 
> (u'spark.app.name ', u'PySparkShell'), 
> (u'spark.yarn.keytab', u'XXX.keytab'), (u'spark.yarn.historyServer.address', 
> u'xxx-001:18080'), (u'spark.rdd.compress', u'True'), (u'spark.eventLog.dir', 
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), 
> (u'spark.ssl.enabledAlgorithms', 
> u'TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA'), 
> (u'spark.serializer.objectStreamReset', u'100'), 
> (u'spark.history.fs.logDirectory', 
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), (u'spark.yarn.isPython', 
> u'true'), (u'spark.submit.deployMode', u'client'), (u'spark.ssl.enabled', 
> u'true'), (u'spark.authenticate', u'true'), (u'spark.ssl.trustStore', 
> u'xxx.truststore')]
> 
> I am not really familiar with "spark.yarn.credentials.file" and had thought 
> it was created automatically after communicating with YARN to get tokens.
> 
> Thanks,
> Mike
> 
> 
> Ted Yu ---11/11/2015 03:35:41 PM---I assume your config contains 
> "spark.yarn.credentials.file" - otherwise startExecutorDelegationToken
> 
> From: Ted Yu mailto:yuzhih...@gmail.com>>
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user mailto:user@spark.apache.org>>
> Date: 11/11/2015 03:35 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> 
> 
> 
> 
> I assume your config contains "spark.yarn.credentials.file" - otherwise 
> startExecutorDelegationTokenRenewer(conf) call would be skipped.
> 
> On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le  > wrote:
> Hi Ted,
> 
> Thanks for reply.
> 
> I tried your patch but am having the same problem.
> 
> I ran:
> 
> ./bin/pyspark --master yarn-client
> 
> >> sc.stop()
> >> sc = SparkContext()
> 
> Same error dump as below.
> 
> Do I need to pass something to the new sparkcontext ?
> 
> Thanks,
> Mike
> 
> Ted Yu ---11/11/2015 01:55:02 PM---Looks like the delegation 
> token should be renewed. Mind trying the following ?
> 
> From: Ted Yu mailto:yuzhih...@gmail.com>>
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user mailto:user@spark.apache.org>>
> Date: 11/11/2015 01:55 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> 
> 
> 
> 
> Looks like the delegation token should be renewed.
> 
> Mind trying the following ?
> 
> Thanks
> 
> diff --git 
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
>  b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
> index 20771f6..e3c4a5a 100644
> --- 
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> +++ 
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> @@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
>  logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
>  val args = new ClientArguments(argsArrayBuf.toArray, conf)
>  totalExpectedExecutors = args.numExecutors
> +// SPARK-8851: In yarn-client mode, the AM still does the credentials 
> refresh. The driver
> +// reads the credentials from HDFS, just like the executors and updates 
> its own credentials
> +// cache.
> +if (conf.contains("spark.yarn.credentials.file")) {
> +  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
> +}
>  client = new Client(args, conf)
>  appId = client.submitApplication()
> 
> @@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(
> 
>  waitForApplication()
> 
> -// SPARK-8851: In yarn-client mode, the AM still does the credentials 
> refresh. The driver
> -// reads the credentials from HDFS, just like the executors and updates 
> its own credentials
> -// cache.
> -if (conf.contains("spark.yarn.credentials.file")) {
> -  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
> -}
>  monitorThread = asyncMonitorApplication()
>  monitorThread.start()
>}
> 
> On Wed, Nov 11, 2015 at 10

Re: How to run two operations on the same RDD simultaneously

2015-11-20 Thread Ali Tajeldin EDU
You can try to use an Accumulator 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator)
 to keep count in map1.  Note that the final count may be higher than the 
number of records if there were some retries along the way.
--
Ali

On Nov 20, 2015, at 3:38 PM, jluan  wrote:

> As far as I understand, operations on rdd's usually come in the form
> 
> rdd => map1 => map2 => map2 => (maybe collect)
> 
> If I would like to also count my RDD, is there any way I could include this
> at map1? So that as spark runs through map1, it also does a count? Or would
> count need to be a separate operation such that I would have to run through
> my dataset again. My dataset is really memory intensive so I'd rather not
> cache() it if possible.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



How to run two operations on the same RDD simultaneously

2015-11-20 Thread jluan
As far as I understand, operations on rdd's usually come in the form

rdd => map1 => map2 => map2 => (maybe collect)

If I would like to also count my RDD, is there any way I could include this
at map1? So that as spark runs through map1, it also does a count? Or would
count need to be a separate operation such that I would have to run through
my dataset again. My dataset is really memory intensive so I'd rather not
cache() it if possible.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



FW: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Mich Talebzadeh
From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 20 November 2015 21:14
To: u...@hive.apache.org
Subject: starting spark-shell throws /tmp/hive on HDFS should be writable
error

 

Hi,

 

Has this been resolved. I don't think this has anything to do with /tmp/hive
directory permission

 

spark-shell

log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.

Using Spark's repl log4j profile:
org/apache/spark/log4j-defaults-repl.properties

To adjust logging level use sc.setLogLevel("INFO")

Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2

  /_/

 

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_25)

Type in expressions to have them evaluated.

Type :help for more information.

java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
dir: /tmp/hive on HDFS should be writable. Current permissions are:
rwx--

 

 

:10: error: not found: value sqlContext

   import sqlContext.implicits._

  ^

:10: error: not found: value sqlContext

   import sqlContext.sql

  ^

 

scala>

 

Thanks,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 



question about combining small input splits

2015-11-20 Thread nezih
Hey everyone,
I have a Hive table that has a lot of small parquet files and I am creating
a data frame out of it to do some processing, but since I have a large
number of splits/files my job creates a lot of tasks, which I don't want.
Basically what I want is the same functionality that Hive provides, that is,
to combine these small input splits into larger ones by specifying a max
split size setting. Is this currently possible with Spark?

While exploring whether I can use coalesce I hit another issue. With
coalesce I can only control the number of output files not their sizes. And
since the total input dataset size can vary significantly in my case, I
cannot just use a fixed partition count as the size of each output can get
very large. I looked for getting the total input size from an rdd to come up
with some heuristic to set the partition count, but I couldn't find any ways
to do it. 

Any help is appreciated.

Thanks,

Nezih



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/question-about-combining-small-input-splits-tp25440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Yarn Spark on EMR

2015-11-20 Thread Bozeman, Christopher
Suraj,

Spark History server is running on 18080 
(http://spark.apache.org/docs/latest/monitoring.html) which is not going to 
give you are real-time update on a running Spark application.   Given this is 
Spark on YARN, you will need to view the Spark UI from the Application Master 
URL which can be found from the YARN Resource Manager UI (master node:8088) and 
it would be best to use a SOCKS proxy in order nicely resolve the URLs.

Best regards,
Christopher


From: SURAJ SHETH [mailto:shet...@gmail.com]
Sent: Sunday, November 15, 2015 8:19 AM
To: user@spark.apache.org
Subject: Yarn Spark on EMR

Hi,
Yarn UI on 18080 stops receiving updates Spark jobs/tasks immediately after it 
starts. We see only one task completed in the UI while the other hasn't got any 
resources while in reality, more than 5 tasks would have completed.
Hadoop - Amazon 2.6
Spark - 1.5

Thanks and Regards,
Suraj Sheth


RE: Spark Expand Cluster

2015-11-20 Thread Bozeman, Christopher
Dan,

Even though you may be adding more nodes to the cluster, the Spark application 
has to be requesting additional executors in order to thus use the added 
resources.  Or the Spark application can be using Dynamic Resource Allocation 
(http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation)
 [which may use the resources based on application need and availability].  For 
example, in EMR release 4.x 
(http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#spark-dynamic-allocation)
 you can request Spark Dynamic Resource Allocation as the default configuration 
at cluster creation.

Best regards,
Christopher


From: Dinesh Ranganathan [mailto:dineshranganat...@gmail.com]
Sent: Monday, November 16, 2015 4:57 AM
To: Sabarish Sasidharan
Cc: user
Subject: Re: Spark Expand Cluster

Hi Sab,

I did not specify number of executors when I submitted the spark application. I 
was in the impression spark looks at the cluster and figures out the number of 
executors it can use based on the cluster size automatically, is this what you 
call dynamic allocation?. I am spark newbie, so apologies if I am missing the 
obvious. While the application was running I added more core nodes by resizing 
my EMR instance and I can see the new nodes on the resource manager but my 
running application did not pick up those machines I've just added.   Let me 
know If i am missing a step here.

Thanks,
Dan

On 16 November 2015 at 12:38, Sabarish Sasidharan 
mailto:sabarish.sasidha...@manthan.com>> wrote:
Spark will use the number of executors you specify in spark-submit. Are you 
saying that Spark is not able to use more executors after you modify it in 
spark-submit? Are you using dynamic allocation?

Regards
Sab

On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan 
mailto:dineshranganat...@gmail.com>> wrote:
I have my Spark application deployed on AWS EMR on yarn cluster mode.  When I
increase the capacity of my cluster by adding more Core instances on AWS, I
don't see Spark picking up the new instances dynamically. Is there anything
I can do to tell Spark to pick up the newly added boxes??

Dan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



--

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
India ICT)
+++



--
Dinesh Ranganathan


Corelation between 2 consecutive RDDs in Dstream

2015-11-20 Thread anshu shukla
1- Is there any wat=y to either make the pair of RDDs from a Dstream-
Dstream  --->   Dstream

so that i can use already defined corelation function in spark.

*Aim is to  find auto-corelation value in spark .(As per my knowledge spark
streaming does not support this.)*


-- 
Thanks & Regards,
Anshu Shukla


Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread kali.tumm...@gmail.com
Hi All,

If write ahead logs are enabled in spark streaming does all the received
data gets written to HDFS path   ? or it only writes the metadata.
How does clean up works , does HDFS path gets bigger and bigger  up everyday
do I need to write an clean up job to delete data from  write ahead logs
folder ?
what actually does write ahead log folder has ?

Thanks
Sri 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-streaming-write-ahead-log-writes-all-received-data-to-HDFS-tp25439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Nevermind. I had a library dependency that still had the old Spark version.

On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey 
wrote:

> The 1.5.2 Spark was compiled using the following options:  mvn
> -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive
> -Phive-thriftserver clean package
>
> Regards,
>
> Bryan Jeffrey
>
> On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to
>> 1.5.2.  Has anyone seen this issue?
>>
>> I'm invoking the following:
>>
>> new HiveContext(sc) // sc is a Spark Context
>>
>> I am seeing the following error:
>>
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/spark/spark-1.5.2/assembly/target/scala-2.11/spark-assembly-1.5.2-hadoop2.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/hadoop-2.6.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> Exception in thread "main" java.lang.NoSuchMethodException:
>> org.apache.hadoop.hive.conf.HiveConf.getTimeVar(org.apache.hadoop.hive.conf.HiveConf$ConfVars,
>> java.util.concurrent.TimeUnit)
>> at java.lang.Class.getMethod(Class.java:1786)
>> at
>> org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
>> at
>> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod$lzycompute(HiveShim.scala:415)
>> at
>> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod(HiveShim.scala:414)
>> at
>> org.apache.spark.sql.hive.client.Shim_v0_14.getMetastoreClientConnectRetryDelayMillis(HiveShim.scala:459)
>> at
>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:198)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>> at
>> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
>> at
>> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
>> at
>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
>> at
>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
>> at
>> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
>> at
>> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
>> at
>> org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
>> at
>> Main.Factories.HiveContextSingleton$.createHiveContext(HiveContextSingleton.scala:21)
>> at
>> Main.Factories.HiveContextSingleton$.getHiveContext(HiveContextSingleton.scala:14)
>> at
>> Main.Factories.SparkStreamingContextFactory$.createSparkContext(SparkStreamingContextFactory.scala:35)
>> at Main.WriteModel$.main(WriteModel.scala:16)
>> at Main.WriteModel.main(WriteModel.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>> at
>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>> at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>
>


Re: getting different results from same line of code repeated

2015-11-20 Thread Ted Yu
Mind trying 1.5.2 release ?

Thanks

On Fri, Nov 20, 2015 at 10:56 AM, Walrus theCat 
wrote:

> I'm running into all kinds of problems with Spark 1.5.1 -- does anyone
> have a version that's working smoothly for them?
>
> On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler 
> wrote:
>
>> I didn't expect that to fail. I would call it a bug for sure, since it's
>> practically useless if this method doesn't work.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Fri, Nov 20, 2015 at 12:45 PM, Walrus theCat 
>> wrote:
>>
>>> Dean,
>>>
>>> What's the point of Scala without magic? :-)
>>>
>>> Thanks for your help.  It's still giving me unreliable results.  There
>>> just has to be a way to do this in Spark.  It's a pretty fundamental thing.
>>>
>>> scala> targets.takeOrdered(1) // imported as implicit here
>>> res23: Array[(String, Int)] = Array()
>>>
>>> scala> targets.takeOrdered(1)(CountOrdering)
>>> res24: Array[(String, Int)] = Array((\bmurders?\b,717))
>>>
>>> scala> targets.takeOrdered(1)(CountOrdering)
>>> res25: Array[(String, Int)] = Array((\bmurders?\b,717))
>>>
>>> scala> targets.takeOrdered(1)(CountOrdering)
>>> res26: Array[(String, Int)] = Array((\bguns?\b,1253))
>>>
>>> scala> targets.takeOrdered(1)(CountOrdering)
>>> res27: Array[(String, Int)] = Array((\bmurders?\b,717))
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 6:20 PM, Dean Wampler 
>>> wrote:
>>>
 You don't have to use sortBy (although that would be better...). You
 have to define an Ordering object and pass it as the second argument list
 to takeOrdered()(), or declare it "implicitly". This is more fancy Scala
 than Spark should require here. Here's an example I've used:

   // schema with (String,Int). Order by the Int descending
   object CountOrdering extends Ordering[(String,Int)] {
 def compare(a:(String,Int), b:(String,Int)) =
   -(a._2 compare b._2)  // - so that it sorts descending
   }

   myRDD.takeOrdered(100)(CountOrdering)


 Or, if you add the keyword "implicit" before "object CountOrdering
 {...}", then you can omit the second argument list. That's more magic than
 is justified. ;)

 HTH,

 dean


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
  (O'Reilly)
 Typesafe 
 @deanwampler 
 http://polyglotprogramming.com

 On Wed, Nov 18, 2015 at 6:37 PM, Walrus theCat 
 wrote:

> Dean,
>
> Thanks a lot.  Very helpful.  How would I use takeOrdered to order by
> the second member of the tuple, as I am attempting to do with
> rdd.sortBy(_._2).first?
>
> On Wed, Nov 18, 2015 at 4:24 PM, Dean Wampler 
> wrote:
>
>> Someone please correct me if I'm wrong, but I think the answer is
>> actually "it's not implemented that way" in the sort methods, and it 
>> should
>> either be documented more explicitly or fixed.
>>
>> Reading the Spark source code, it looks like each partition is sorted
>> internally, and each partition holds a contiguous range of keys in the 
>> RDD.
>> So, if you know which order the partitions should be in, you can produce 
>> a
>> total order and hence allow take(n) to do what you expect.
>>
>> The take(n) appears to walk the list of partitions in order, but it's
>> that list that's not deterministic. I can't find any evidence that the 
>> RDD
>> output by sortBy has this list of partitions in the correct order. So, 
>> each
>> time you ran your job, the "targets" RDD had sorted partitions, but the
>> list of partitions itself was not properly ordered globally. When you got
>> an exception, probably the first partition happened to be empty.
>>
>> Now, you could argue that take(n) is a "debug" method and the
>> performance implications of getting the RDD.partitions list in total 
>> order
>> is not justified. There is a takeOrdered(n) method that is both much more
>> efficient than sort().take(n), and it does the correct thing. Still, at 
>> the
>> very least, the documentation for take(n) should tell you what to expect.
>>
>> Hope I'm right and this helps!
>>
>> dean
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Nov 18, 2015 at 5:53 PM, Walrus theCat <
>> walrusthe...@gmail.com> wrote:
>>
>>> Dean,
>>>
>>> Thanks for the insight.  Shouldn'

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Cody Koeninger
You're confused about which parts of your code are running on the driver vs
the executor, which is why you're getting serialization errors.

Read

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd



On Fri, Nov 20, 2015 at 1:07 PM, Saiph Kappa  wrote:

> I think my problem persists whether I use Kafka or sockets. Or am I wrong?
> How would you use Kafka here?
>
> On Fri, Nov 20, 2015 at 7:12 PM, Christian  wrote:
>
>> Have you considered using Kafka?
>>
>> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a basic spark streaming application like this:
>>>
>>> «
>>> ...
>>>
>>> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
>>> val rawStreams = (1 to numStreams).map(_ =>
>>>   ssc.rawSocketStream[String](host, port, 
>>> StorageLevel.MEMORY_ONLY_SER)).toArray
>>> val union = ssc.union(rawStreams)
>>>
>>> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>>>
>>>   // TODO
>>>
>>> }
>>> ...
>>> »
>>>
>>>
>>> My question is: what is the best and fastest way to send the resulting rdds
>>> as input to be consumed by another spark streaming application?
>>>
>>> I tried to add this code in place of the "TODO" comment:
>>>
>>> «
>>> val serverSocket = new ServerSocket(9998)
>>> while (true) {
>>>   val socket = serverSocket.accept()
>>>   @transient val out = new PrintWriter(socket.getOutputStream)
>>>   try {
>>> rdd.foreach(out.write)
>>>   } catch {
>>> case e: IOException =>
>>>   socket.close()
>>>   }
>>> }
>>> »
>>>
>>>
>>> I also tried to create a thread in the driver application code to launch the
>>> socket server and then share state (the PrintWriter object) between the 
>>> driver program and tasks.
>>> But got an exception saying that task is not serializable - PrintWriter is 
>>> not serializable
>>> (despite the @trasient annotation). I know this is not a very elegant 
>>> solution, but what other
>>> directions should I explore?
>>>
>>> Thanks.
>>>
>>>
>


Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
The 1.5.2 Spark was compiled using the following options:  mvn
-Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive
-Phive-thriftserver clean package

Regards,

Bryan Jeffrey

On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey 
wrote:

> Hello.
>
> I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to
> 1.5.2.  Has anyone seen this issue?
>
> I'm invoking the following:
>
> new HiveContext(sc) // sc is a Spark Context
>
> I am seeing the following error:
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/spark/spark-1.5.2/assembly/target/scala-2.11/spark-assembly-1.5.2-hadoop2.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/hadoop-2.6.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" java.lang.NoSuchMethodException:
> org.apache.hadoop.hive.conf.HiveConf.getTimeVar(org.apache.hadoop.hive.conf.HiveConf$ConfVars,
> java.util.concurrent.TimeUnit)
> at java.lang.Class.getMethod(Class.java:1786)
> at
> org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod$lzycompute(HiveShim.scala:415)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod(HiveShim.scala:414)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.getMetastoreClientConnectRetryDelayMillis(HiveShim.scala:459)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:198)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
> at
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
> at
> org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
> at
> Main.Factories.HiveContextSingleton$.createHiveContext(HiveContextSingleton.scala:21)
> at
> Main.Factories.HiveContextSingleton$.getHiveContext(HiveContextSingleton.scala:14)
> at
> Main.Factories.SparkStreamingContextFactory$.createSparkContext(SparkStreamingContextFactory.scala:35)
> at Main.WriteModel$.main(WriteModel.scala:16)
> at Main.WriteModel.main(WriteModel.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>


Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Hello.

I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to
1.5.2.  Has anyone seen this issue?

I'm invoking the following:

new HiveContext(sc) // sc is a Spark Context

I am seeing the following error:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/spark/spark-1.5.2/assembly/target/scala-2.11/spark-assembly-1.5.2-hadoop2.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/hadoop-2.6.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodException:
org.apache.hadoop.hive.conf.HiveConf.getTimeVar(org.apache.hadoop.hive.conf.HiveConf$ConfVars,
java.util.concurrent.TimeUnit)
at java.lang.Class.getMethod(Class.java:1786)
at
org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
at
org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod$lzycompute(HiveShim.scala:415)
at
org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod(HiveShim.scala:414)
at
org.apache.spark.sql.hive.client.Shim_v0_14.getMetastoreClientConnectRetryDelayMillis(HiveShim.scala:459)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:198)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
at
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
at
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
at
org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
at
org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
at
Main.Factories.HiveContextSingleton$.createHiveContext(HiveContextSingleton.scala:21)
at
Main.Factories.HiveContextSingleton$.getHiveContext(HiveContextSingleton.scala:14)
at
Main.Factories.SparkStreamingContextFactory$.createSparkContext(SparkStreamingContextFactory.scala:35)
at Main.WriteModel$.main(WriteModel.scala:16)
at Main.WriteModel.main(WriteModel.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
I think my problem persists whether I use Kafka or sockets. Or am I wrong?
How would you use Kafka here?

On Fri, Nov 20, 2015 at 7:12 PM, Christian  wrote:

> Have you considered using Kafka?
>
> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa  wrote:
>
>> Hi,
>>
>> I have a basic spark streaming application like this:
>>
>> «
>> ...
>>
>> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
>> val rawStreams = (1 to numStreams).map(_ =>
>>   ssc.rawSocketStream[String](host, port, 
>> StorageLevel.MEMORY_ONLY_SER)).toArray
>> val union = ssc.union(rawStreams)
>>
>> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>>
>>   // TODO
>>
>> }
>> ...
>> »
>>
>>
>> My question is: what is the best and fastest way to send the resulting rdds
>> as input to be consumed by another spark streaming application?
>>
>> I tried to add this code in place of the "TODO" comment:
>>
>> «
>> val serverSocket = new ServerSocket(9998)
>> while (true) {
>>   val socket = serverSocket.accept()
>>   @transient val out = new PrintWriter(socket.getOutputStream)
>>   try {
>> rdd.foreach(out.write)
>>   } catch {
>> case e: IOException =>
>>   socket.close()
>>   }
>> }
>> »
>>
>>
>> I also tried to create a thread in the driver application code to launch the
>> socket server and then share state (the PrintWriter object) between the 
>> driver program and tasks.
>> But got an exception saying that task is not serializable - PrintWriter is 
>> not serializable
>> (despite the @trasient annotation). I know this is not a very elegant 
>> solution, but what other
>> directions should I explore?
>>
>> Thanks.
>>
>>


Re: getting different results from same line of code repeated

2015-11-20 Thread Walrus theCat
I'm running into all kinds of problems with Spark 1.5.1 -- does anyone have
a version that's working smoothly for them?

On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler 
wrote:

> I didn't expect that to fail. I would call it a bug for sure, since it's
> practically useless if this method doesn't work.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Fri, Nov 20, 2015 at 12:45 PM, Walrus theCat 
> wrote:
>
>> Dean,
>>
>> What's the point of Scala without magic? :-)
>>
>> Thanks for your help.  It's still giving me unreliable results.  There
>> just has to be a way to do this in Spark.  It's a pretty fundamental thing.
>>
>> scala> targets.takeOrdered(1) // imported as implicit here
>> res23: Array[(String, Int)] = Array()
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res24: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res25: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res26: Array[(String, Int)] = Array((\bguns?\b,1253))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res27: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>>
>>
>> On Wed, Nov 18, 2015 at 6:20 PM, Dean Wampler 
>> wrote:
>>
>>> You don't have to use sortBy (although that would be better...). You
>>> have to define an Ordering object and pass it as the second argument list
>>> to takeOrdered()(), or declare it "implicitly". This is more fancy Scala
>>> than Spark should require here. Here's an example I've used:
>>>
>>>   // schema with (String,Int). Order by the Int descending
>>>   object CountOrdering extends Ordering[(String,Int)] {
>>> def compare(a:(String,Int), b:(String,Int)) =
>>>   -(a._2 compare b._2)  // - so that it sorts descending
>>>   }
>>>
>>>   myRDD.takeOrdered(100)(CountOrdering)
>>>
>>>
>>> Or, if you add the keyword "implicit" before "object CountOrdering
>>> {...}", then you can omit the second argument list. That's more magic than
>>> is justified. ;)
>>>
>>> HTH,
>>>
>>> dean
>>>
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Nov 18, 2015 at 6:37 PM, Walrus theCat 
>>> wrote:
>>>
 Dean,

 Thanks a lot.  Very helpful.  How would I use takeOrdered to order by
 the second member of the tuple, as I am attempting to do with
 rdd.sortBy(_._2).first?

 On Wed, Nov 18, 2015 at 4:24 PM, Dean Wampler 
 wrote:

> Someone please correct me if I'm wrong, but I think the answer is
> actually "it's not implemented that way" in the sort methods, and it 
> should
> either be documented more explicitly or fixed.
>
> Reading the Spark source code, it looks like each partition is sorted
> internally, and each partition holds a contiguous range of keys in the 
> RDD.
> So, if you know which order the partitions should be in, you can produce a
> total order and hence allow take(n) to do what you expect.
>
> The take(n) appears to walk the list of partitions in order, but it's
> that list that's not deterministic. I can't find any evidence that the RDD
> output by sortBy has this list of partitions in the correct order. So, 
> each
> time you ran your job, the "targets" RDD had sorted partitions, but the
> list of partitions itself was not properly ordered globally. When you got
> an exception, probably the first partition happened to be empty.
>
> Now, you could argue that take(n) is a "debug" method and the
> performance implications of getting the RDD.partitions list in total order
> is not justified. There is a takeOrdered(n) method that is both much more
> efficient than sort().take(n), and it does the correct thing. Still, at 
> the
> very least, the documentation for take(n) should tell you what to expect.
>
> Hope I'm right and this helps!
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Nov 18, 2015 at 5:53 PM, Walrus theCat  > wrote:
>
>> Dean,
>>
>> Thanks for the insight.  Shouldn't take(n) or first return the same
>> result, provided that the RDD is sorted?  If I specify that the RDD is
>> ordered, I need to have guarantees as I reason about it that the first 
>> item
>> is in fact the first, and the last is the last.
>>
>> On We

Re: Drop multiple columns in the DataFrame API

2015-11-20 Thread Ted Yu
Created PR:
https://github.com/apache/spark/pull/9862

On Fri, Nov 20, 2015 at 10:17 AM, BenFradet 
wrote:

> Hi everyone,
>
> I was wondering if there is a better way to drop mutliple columns from a
> dataframe or why there is no drop(cols: Column*) method in the dataframe
> API.
>
> Indeed, I tend to write code like this:
>
> val filteredDF = df.drop("colA")
>.drop("colB")
>.drop("colC")
> //etc
>
> which is a bit lengthy, or:
>
> val colsToRemove = Seq("colA", "colB", "colC", etc)
> val filteredDF = df.select(df.columns
>   .filter(colName => !colsToRemove.contains(colName))
>   .map(colName => new Column(colName)): _*)
>
> which is, I think, a bit ugly.
>
> Thanks,
> Ben.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Drop-multiple-columns-in-the-DataFrame-API-tp25438.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Drop multiple columns in the DataFrame API

2015-11-20 Thread BenFradet
Hi everyone,

I was wondering if there is a better way to drop mutliple columns from a
dataframe or why there is no drop(cols: Column*) method in the dataframe
API.

Indeed, I tend to write code like this:

val filteredDF = df.drop("colA")
   .drop("colB")
   .drop("colC")
//etc

which is a bit lengthy, or:

val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns
  .filter(colName => !colsToRemove.contains(colName))
  .map(colName => new Column(colName)): _*)

which is, I think, a bit ugly.

Thanks,
Ben.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Drop-multiple-columns-in-the-DataFrame-API-tp25438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Christian
Have you considered using Kafka?
On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa  wrote:

> Hi,
>
> I have a basic spark streaming application like this:
>
> «
> ...
>
> val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
> val rawStreams = (1 to numStreams).map(_ =>
>   ssc.rawSocketStream[String](host, port, 
> StorageLevel.MEMORY_ONLY_SER)).toArray
> val union = ssc.union(rawStreams)
>
> union.flatMap(line => line.split(' ')).foreachRDD(rdd => {
>
>   // TODO
>
> }
> ...
> »
>
>
> My question is: what is the best and fastest way to send the resulting rdds
> as input to be consumed by another spark streaming application?
>
> I tried to add this code in place of the "TODO" comment:
>
> «
> val serverSocket = new ServerSocket(9998)
> while (true) {
>   val socket = serverSocket.accept()
>   @transient val out = new PrintWriter(socket.getOutputStream)
>   try {
> rdd.foreach(out.write)
>   } catch {
> case e: IOException =>
>   socket.close()
>   }
> }
> »
>
>
> I also tried to create a thread in the driver application code to launch the
> socket server and then share state (the PrintWriter object) between the 
> driver program and tasks.
> But got an exception saying that task is not serializable - PrintWriter is 
> not serializable
> (despite the @trasient annotation). I know this is not a very elegant 
> solution, but what other
> directions should I explore?
>
> Thanks.
>
>


Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Andy Davidson
Hi Igor

Thanks . The reason I am using cluster mode is this the stream app must will
run for ever. I am using client mode for my pyspark work

Andy

From:  Igor Berman 
Date:  Friday, November 20, 2015 at 6:22 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: newbie: unable to use all my cores and memory

> u've asked total cores to be 2 + 1 for driver(since you are running in cluster
> mode, so it's running on one of the slaves)
> change total cores to be 3*2
> change submit mode to be client - you'll have full utilization
> (btw it's not advisable to use all cores of slave...since there is OS
> processes and other processes...)
> 
> On 20 November 2015 at 02:02, Andy Davidson 
> wrote:
>> I am having a heck of a time figuring out how to utilize my cluster
>> effectively. I am using the stand alone cluster manager. I have a master
>> and 3 slaves. Each machine has 2 cores.
>> 
>> I am trying to run a streaming app in cluster mode and pyspark at the same
>> time.
>> 
>> t1) On my console I see
>> 
>> * Alive Workers: 3
>> * Cores in use: 6 Total, 0 Used
>> * Memory in use: 18.8 GB Total, 0.0 B Used
>> * Applications: 0 Running, 15 Completed
>> * Drivers: 0 Running, 2 Completed
>> * Status: ALIVE
>> 
>> t2) I start my streaming app
>> 
>> $SPARK_ROOT/bin/spark-submit \
>> --class "com.pws.spark.streaming.IngestDriver" \
>> --master $MASTER_URL \
>> --total-executor-cores 2 \
>> --deploy-mode cluster \
>> $jarPath --clusterMode  $*
>> 
>> t3) on my console I see
>> 
>> * Alive Workers: 3
>> * Cores in use: 6 Total, 3 Used
>> * Memory in use: 18.8 GB Total, 13.0 GB Used
>> * Applications: 1 Running, 15 Completed
>> * Drivers: 1 Running, 2 Completed
>> * Status: ALIVE
>> 
>> Looks like pyspark should be able to use the 3 remaining cores and 5.8 GB
>> of memory
>> 
>> t4) I start pyspark
>> 
>> export PYSPARK_PYTHON=python3.4
>> export PYSPARK_DRIVER_PYTHON=python3.4
>> export IPYTHON_OPTS="notebook --no-browser --port=7000
>> --log-level=WARN"
>> 
>> $SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores 3
>> --executor-memory 2g
>> 
>> t5) on my console I see
>> 
>> * Alive Workers: 3
>> * Cores in use: 6 Total, 4 Used
>> * Memory in use: 18.8 GB Total, 15.0 GB Used
>> * Applications: 2 Running, 18 Completed
>> * Drivers: 1 Running, 2 Completed
>> * Status: ALIVE
>> 
>> 
>> I have 2 unused cores and a lot of memory left over. My pyspark
>> application is going getting 1 core. If streaming app is not running
>> pyspark would be assigned 2 cores each on a different worker. I have tried
>> using various combinations of --executor-cores and --total-executor-cores.
>> Any idea how to get pyspark to use more cores and memory?
>> 
>> 
>> Kind regards
>> 
>> Andy
>> 
>> P.s.  Using different values I have wound up with  pyspark status ==
>> ³waiting² I think this is because there are not enough cores available?
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 




how to use sc.hadoopConfiguration from pyspark

2015-11-20 Thread Tamas Szuromi
Hello,

I've just wanted to use sc._jsc.hadoopConfiguration().set('key','value') in
pyspark 1.5.2 but I got set method not exists error.

Are there anyone who know a workaround to set some hdfs related properties
like dfs.blocksize?

Thanks in advance!

Tamas


Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
Hi,

I have Spark application which contains the following segment:

val reparitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned,
startDate, endDate)
val mapped: RDD[(DateTime, myData)] =
filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)

When I run this with some logging this is what I see:

reparitioned ==> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512,
2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076,
2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076,
2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ==> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]

My logging is done using these two lines:

val sizes: RDD[Int] = rdd.mapPartitions(iter =>
Array(iter.size).iterator, true)log.info(s"rdd ==>
[${sizes.collect.toList}]")

My question is why does my data end up in one partition after the
reduceByKey? After the filter it can be seen that the data is evenly
distributed, but the reduceByKey results in data in only one partition.

Thanks,

Patrick


Re: Save GraphX to disk

2015-11-20 Thread Ashish Rawat
Hi Todd,

Could you please provide an example of doing this. Mazerunner seems to be doing 
something similar with Neo4j but it goes via hdfs and updates only the graph 
properties. Is there a direct way to do this with Neo4j or Titan?

Regards,
Ashish

From: SLiZn Liu mailto:sliznmail...@gmail.com>>
Date: Saturday, 14 November 2015 7:44 am
To: Gaurav Kumar mailto:gauravkuma...@gmail.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Save GraphX to disk

Hi Gaurav,

Your graph can be saved to graph databases like Neo4j or Titan through their 
drivers, that eventually saved to the disk.

BR,
Todd

Gaurav Kumar
gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道:
Hi,

I was wondering how to save a graph to disk and load it back again. I know how 
to save vertices and edges to disk and construct the graph from them, not sure 
if there's any method to save the graph itself to disk.

Best Regards,
Gaurav Kumar
Big Data * Data Science * Photography * Music
+91 9953294125


Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in
cluster mode, so it's running on one of the slaves)
change total cores to be 3*2
change submit mode to be client - you'll have full utilization
(btw it's not advisable to use all cores of slave...since there is OS
processes and other processes...)

On 20 November 2015 at 02:02, Andy Davidson 
wrote:

> I am having a heck of a time figuring out how to utilize my cluster
> effectively. I am using the stand alone cluster manager. I have a master
> and 3 slaves. Each machine has 2 cores.
>
> I am trying to run a streaming app in cluster mode and pyspark at the same
> time.
>
> t1) On my console I see
>
> * Alive Workers: 3
> * Cores in use: 6 Total, 0 Used
> * Memory in use: 18.8 GB Total, 0.0 B Used
> * Applications: 0 Running, 15 Completed
> * Drivers: 0 Running, 2 Completed
> * Status: ALIVE
>
> t2) I start my streaming app
>
> $SPARK_ROOT/bin/spark-submit \
> --class "com.pws.spark.streaming.IngestDriver" \
> --master $MASTER_URL \
> --total-executor-cores 2 \
> --deploy-mode cluster \
> $jarPath --clusterMode  $*
>
> t3) on my console I see
>
> * Alive Workers: 3
> * Cores in use: 6 Total, 3 Used
> * Memory in use: 18.8 GB Total, 13.0 GB Used
> * Applications: 1 Running, 15 Completed
> * Drivers: 1 Running, 2 Completed
> * Status: ALIVE
>
> Looks like pyspark should be able to use the 3 remaining cores and 5.8 GB
> of memory
>
> t4) I start pyspark
>
> export PYSPARK_PYTHON=python3.4
> export PYSPARK_DRIVER_PYTHON=python3.4
> export IPYTHON_OPTS="notebook --no-browser --port=7000
> --log-level=WARN"
>
> $SPARK_ROOT/bin/pyspark --master $MASTER_URL
> --total-executor-cores 3
> --executor-memory 2g
>
> t5) on my console I see
>
> * Alive Workers: 3
> * Cores in use: 6 Total, 4 Used
> * Memory in use: 18.8 GB Total, 15.0 GB Used
> * Applications: 2 Running, 18 Completed
> * Drivers: 1 Running, 2 Completed
> * Status: ALIVE
>
>
> I have 2 unused cores and a lot of memory left over. My pyspark
> application is going getting 1 core. If streaming app is not running
> pyspark would be assigned 2 cores each on a different worker. I have tried
> using various combinations of --executor-cores and --total-executor-cores.
> Any idea how to get pyspark to use more cores and memory?
>
>
> Kind regards
>
> Andy
>
> P.s.  Using different values I have wound up with  pyspark status ==
> ³waiting² I think this is because there are not enough cores available?
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll
get what you want, however pay attention that when you'll move to multinode
cluster - there will be difference

On 20 November 2015 at 05:10, Afshartous, Nick 
wrote:

>
> < log4j.properties file only exists on the master and not the slave nodes,
> so you are probably running into
> https://issues.apache.org/jira/browse/SPARK-11105, which has already been
> fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0
> once it is released.
>
> Thanks for the info, though this is a single-node cluster so that can't be
> the cause of the error (which is in the driver log).
> --
>   Nick
> 
> From: Jonathan Kelly [jonathaka...@gmail.com]
> Sent: Thursday, November 19, 2015 6:45 PM
> To: Afshartous, Nick
> Cc: user@spark.apache.org
> Subject: Re: Configuring Log4J (Spark 1.5 on EMR 4.1)
>
> This file only exists on the master and not the slave nodes, so you are
> probably running into https://issues.apache.org/jira/browse/SPARK-11105,
> which has already been fixed in the not-yet-released Spark 1.6.0. EMR will
> upgrade to Spark 1.6.0 once it is released.
>
> ~ Jonathan
>
> On Thu, Nov 19, 2015 at 1:30 PM, Afshartous, Nick  > wrote:
>
> Hi,
>
> On Spark 1.5 on EMR 4.1 the message below appears in stderr in the Yarn UI.
>
>   ERROR StatusLogger No log4j2 configuration file found. Using default
> configuration: logging only errors to the console.
>
> I do see that there is
>
>/usr/lib/spark/conf/log4j.properties
>
> Can someone please advise on how to setup log4j properly.
>
> Thanks,
> --
>   Nick
>
> Notice: This communication is for the intended recipient(s) only and may
> contain confidential, proprietary, legally protected or privileged
> information of Turbine, Inc. If you are not the intended recipient(s),
> please notify the sender at once and delete this communication.
> Unauthorized use of the information in this communication is strictly
> prohibited and may be unlawful. For those recipients under contract with
> Turbine, Inc., the information in this communication is subject to the
> terms and conditions of any applicable contracts or agreements.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org user-h...@spark.apache.org>
>
>
>
> Notice: This communication is for the intended recipient(s) only and may
> contain confidential, proprietary, legally protected or privileged
> information of Turbine, Inc. If you are not the intended recipient(s),
> please notify the sender at once and delete this communication.
> Unauthorized use of the information in this communication is strictly
> prohibited and may be unlawful. For those recipients under contract with
> Turbine, Inc., the information in this communication is subject to the
> terms and conditions of any applicable contracts or agreements.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to control number of parquet files generated when using partitionBy

2015-11-20 Thread glennie
Turns out that calling repartition(numberOfParquetFilesPerPartition) just
before write will create exactly numberOfParquetFilesPerPartition files in
each folder. 

dataframe 
  .repartition(10)
  .write 
  .mode(SaveMode.Append) 
  .partitionBy("year", "month", "date", "country", "predicate") 
  .parquet(outputPath) 

I'm not sure why this works - I would have thought that repartition(10)
would partition the original dataframe into 10 partitions BEFORE partitionBy
does its magic, but apparently that is not the case... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436p25437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming - stream between 2 applications

2015-11-20 Thread Saiph Kappa
Hi,

I have a basic spark streaming application like this:

«
...

val ssc = new StreamingContext(sparkConf, Duration(batchMillis))
val rawStreams = (1 to numStreams).map(_ =>
  ssc.rawSocketStream[String](host, port, StorageLevel.MEMORY_ONLY_SER)).toArray
val union = ssc.union(rawStreams)

union.flatMap(line => line.split(' ')).foreachRDD(rdd => {

  // TODO

}
...
»


My question is: what is the best and fastest way to send the resulting rdds
as input to be consumed by another spark streaming application?

I tried to add this code in place of the "TODO" comment:

«
val serverSocket = new ServerSocket(9998)
while (true) {
  val socket = serverSocket.accept()
  @transient val out = new PrintWriter(socket.getOutputStream)
  try {
rdd.foreach(out.write)
  } catch {
case e: IOException =>
  socket.close()
  }
}
»


I also tried to create a thread in the driver application code to launch the
socket server and then share state (the PrintWriter object) between
the driver program and tasks.
But got an exception saying that task is not serializable -
PrintWriter is not serializable
(despite the @trasient annotation). I know this is not a very elegant
solution, but what other
directions should I explore?

Thanks.


Re: RE: Error not found value sqlContext

2015-11-20 Thread satish chandra j
HI All,
I am getting this error while generating executable Jar file itself in
Eclipse, if the Spark Application code has "import sqlContext.implicits._"
line in it. Spark Applicaiton code  works fine if the above mentioned line
does not exist as I have tested by fetching data from an RDBMS by
implementing JDBCRDD

I tried couple of DataFrame related methods for which most of them errors
stating that method has been overloaded

Please let me know if any further inputs needed to analyze it

Regards,
Satish Chandra

On Fri, Nov 20, 2015 at 5:46 PM, prosp4300  wrote:

>
> Looks like a classpath problem, if you can provide the command you used to
> run your application and environment variable SPARK_HOME, it will help
> others to identify the root problem
>
>
> 在2015年11月20日 18:59,Satish  写道:
>
> Hi Michael,
> As my current Spark version is 1.4.0 than why it error out as "error: not
> found: value sqlContext" when I have "import sqlContext.implicits._" in my
> Spark Job
>
> Regards
> Satish Chandra
> --
> From: Michael Armbrust 
> Sent: ‎20-‎11-‎2015 01:36
> To: satish chandra j 
> Cc: user ; hari krishna 
> Subject: Re: Error not found value sqlContext
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13
>
> On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI All,
>> we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching
>> data from an RDBMS using JDBCRDD and register it as temp table to perform
>> SQL query
>>
>> Below approach is working fine in Spark 1.2.1:
>>
>> JDBCRDD --> apply map using Case Class --> apply createSchemaRDD -->
>> registerTempTable --> perform SQL Query
>>
>> but now as createSchemaRDD is not supported in Spark 1.4.0
>>
>> JDBCRDD --> apply map using Case Class with* .toDF()* -->
>> registerTempTable --> perform SQL query on temptable
>>
>>
>> JDBCRDD --> apply map using Case Class --> RDD*.toDF()*.registerTempTable
>> --> perform SQL query on temptable
>>
>> Only solution I get everywhere is to  use "import sqlContext.implicits._"
>> after val SQLContext = new org.apache.spark.sql.SQLContext(sc)
>>
>> But it errors with the two generic errors
>>
>> *1. error: not found: value sqlContext*
>>
>> *2. value toDF is not a member of org.apache.spark.rdd.RDD*
>>
>>
>>
>>
>>
>>
>
>
>


回复:RE: Error not found value sqlContext

2015-11-20 Thread prosp4300

Looks like a classpath problem, if you can provide the command you used to run 
your application and environment variable SPARK_HOME, it will help others to 
identify the root problem



在2015年11月20日 18:59,Satish 写道:
Hi Michael,
As my current Spark version is 1.4.0 than why it error out as "error: not 
found: value sqlContext" when I have "import sqlContext.implicits._" in my 
Spark Job

Regards
Satish Chandra
From: Michael Armbrust
Sent: ‎20-‎11-‎2015 01:36
To: satish chandra j
Cc: user; hari krishna
Subject: Re: Error not found value sqlContext


http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13



On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j  
wrote:

HI All,
we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching data 
from an RDBMS using JDBCRDD and register it as temp table to perform SQL query


Below approach is working fine in Spark 1.2.1:


JDBCRDD --> apply map using Case Class --> apply createSchemaRDD --> 
registerTempTable --> perform SQL Query


but now as createSchemaRDD is not supported in Spark 1.4.0



JDBCRDD --> apply map using Case Class with .toDF() --> registerTempTable --> 
perform SQL query on temptable




JDBCRDD --> apply map using Case Class --> RDD.toDF().registerTempTable --> 
perform SQL query on temptable



Only solution I get everywhere is to  use "import sqlContext.implicits._" after 
val SQLContext = new org.apache.spark.sql.SQLContext(sc)


But it errors with the two generic errors


1. error: not found: value sqlContext


2. value toDF is not a member of org.apache.spark.rdd.RDD













How to control number of parquet files generated when using partitionBy

2015-11-20 Thread glennie
I have a DataFrame that I need to write to S3 according to a specific
partitioning. The code looks like this:

dataframe
  .write
  .mode(SaveMode.Append)
  .partitionBy("year", "month", "date", "country", "predicate")
  .parquet(outputPath)

The partitionBy splits the data into a fairly large number of folders (~400)
with just a little bit of data (~1GB) in each. And here comes the problem -
because the default value of spark.sql.shuffle.partitions is 200, the 1GB of
data in each folder is split into 200 small parquet files, resulting in
roughly 8 parquet files being written in total. This is not optimal for
a number of reasons and I would like to avoid this.

I could perhaps set the spark.sql.shuffle.partitions to a much smaller
number, say 10, but as I understand this setting also controls the number of
partitions for shuffles in joins and aggregation, so I don't really want to
change this.

Is there another way to control how many files are written?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Error not found value sqlContext

2015-11-20 Thread Satish
Hi Michael,
As my current Spark version is 1.4.0 than why it error out as "error: not 
found: value sqlContext" when I have "import sqlContext.implicits._" in my 
Spark Job

Regards 
Satish Chandra

-Original Message-
From: "Michael Armbrust" 
Sent: ‎20-‎11-‎2015 01:36
To: "satish chandra j" 
Cc: "user" ; "hari krishna" 
Subject: Re: Error not found value sqlContext

http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13



On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j  
wrote:

HI All,
we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching data 
from an RDBMS using JDBCRDD and register it as temp table to perform SQL query


Below approach is working fine in Spark 1.2.1:


JDBCRDD --> apply map using Case Class --> apply createSchemaRDD --> 
registerTempTable --> perform SQL Query


but now as createSchemaRDD is not supported in Spark 1.4.0



JDBCRDD --> apply map using Case Class with .toDF() --> registerTempTable --> 
perform SQL query on temptable




JDBCRDD --> apply map using Case Class --> RDD.toDF().registerTempTable --> 
perform SQL query on temptable



Only solution I get everywhere is to  use "import sqlContext.implicits._" after 
val SQLContext = new org.apache.spark.sql.SQLContext(sc)


But it errors with the two generic errors


1. error: not found: value sqlContext


2. value toDF is not a member of org.apache.spark.rdd.RDD

spark-shell issue Job in illegal state & sparkcontext not serializable

2015-11-20 Thread Balachandar R.A.
Hello users,

In one of my usecases, I need to launch a spark job from spark-shell. My
input file is in HDFS and I am using NewHadoopRDD to construct RDD out of
this input file as it uses custom input format.

  val hConf = sc.hadoopConfiguration



var job = new Job(hConf)



FileInputFormat.setInputPaths(job,new Path(path));


 var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],

classOf[IntWritable],

classOf[BytesWritable],

job.getConfiguration()

)


However, when I run this commands, epsecially the 2nd line, in spark-shell,
I get the below exception

java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING


So, I put these lines inside a method in another class. In my driver code,
I am calling the method.


 // obtain RDD for the input file in HDFS

  var hRDD = cas.getRDD("/user/bala/RDF_STORE/MERGED_DAR.DAT_256")



The above line construct Hadoop RDD and return the handle to the driver
code. This works fine now. This means that I could work around
illegalstateException issue. However, when I run the below command


  // runs the map and collect the results from distributed machines

  val result = hRDD.mapPartitionsWithInputSplit{ (split, iter) =>
cas.extractCAS(split, iter)}.collect()




I get an error "Caused by: java.io.NotSerializableException:
org.apache.spark.SparkContext"


Can someone help how can I work around this problem?? This code works
perfectly when I use spark-submit though but I cannot use this. My ultimate
idea is to run the driver code from zeppelin and hence i am testing this
code in spark-shell



with regards

Bala


RE: 回复: has any spark write orc document

2015-11-20 Thread Ewan Leith
Looking in the code

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

I don’t think any of that advanced functionality is supported sorry �C there is 
a parameters option, but I don’t think it’s used for much.

Ewan

From: zhangjp [mailto:592426...@qq.com]
Sent: 20 November 2015 07:47
To: Fengdong Yu 
Cc: user 
Subject: 回复: has any spark write orc document

Thanks to Jeff Zhang and FengDong. When i use hive i know how to set orc table 
properties,but i don't know how to set orcfile properties when i write a 
orcfile using spark api. for example strip.size 、index.rows.
-- 原始邮件 --
发件人: "Fengdong Yu"mailto:fengdo...@everstring.com>>
发送时间: 2015年11月20日(星期五) 下午3:19
收件人: "zhangjp"<592426...@qq.com>;
抄送: "user"mailto:user@spark.apache.org>>;
主题: Re: has any spark write orc document
You can use DataFrame:

sqlContext.write.format(“orc”).save(“")



On Nov 20, 2015, at 2:59 PM, zhangjp 
<592426...@qq.com> wrote:

Hi,
has any spark write orc document which like the parquet document.
 http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

Thanks



Question About Task Number In A spark Stage

2015-11-20 Thread Gerald-G
Hi:
  Recently we try to submit our spark apps to Yarn-Client Model
  And find that task numbers in A stage is 2810

  But in spark stand alone mode , the same apps need much less tasks

  Then trough debug info , we know that most of these 2810 task runs NULL
DATA

   How can i tuning this?

SUBMIT COMMAND IS :

 spark-submit --num-executors 200 --conf spark.default.parallelism=32
--conf spark.sql.shuffle.partitions=8 --jars
mysql-connector-java.jar,log4j-api-2.3.jar,log4j-core-2.3.jar --master
yarn-client

INFO:

[Stage 0:==>  (1 + 1) / 2][Stage 1:>(0 + 2) / 2][Stage 3:(125 + 9) /
2433]2]


notice that in stage 3 these is 2433 task, but in standalone model much
less than it

Thanks