date:20160510




[root@ES01 test]# jps
10409 Master
12578 CoarseGrainedExecutorBackend
24089 NameNode
17705 Jps
24184 DataNode
10603 Worker
12420 SparkSubmit






[root@ES01 test]# ps -awx | grep -i spark | grep java
10409 ?Sl 1:52 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip 
ES01 --port 7077 --webui-port 8080
10603 ?Sl 6:50 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://ES01:7077
12420 ?Sl18:47 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit 
--master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 
--executor-memory 4G --num-executors 1 --total-executor-cores 1 
/opt/flowSpark/sparkStream/ForAsk01.py
12578 ?Sl38:18 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname 
10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url 
spark://Worker@10.79.148.184:52660





在 2016-05-11 13:18:10，"Mich Talebzadeh"  写道：

what does jps returning?


jps
16738 ResourceManager
14786 Worker
17059 JobHistoryServer
12421 QuorumPeerMain
9061 RunJar
9286 RunJar
5190 SparkSubmit
16806 NodeManager
16264 DataNode
16138 NameNode
16430 SecondaryNameNode
22036 SparkSubmit
9557 Jps
13240 Kafka
2522 Master


and


ps -awx | grep -i spark | grep java





Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 11 May 2016 at 03:01, 李明伟  wrote:

Hi Mich


From the ps command. I can find four process. 10409 is the master and 10603 is 
the worker. 12420 is the driver program and 12578 should be the executor 
(worker). Am I right? 
So you mean the 12420 is actually running both the driver and the worker role?


[root@ES01 ~]# ps -awx | grep spark | grep java
10409 ?Sl 1:40 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip 
ES01 --port 7077 --webui-port 8080
10603 ?Sl 6:00 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://ES01:7077
12420 ?Sl 6:34 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit 
--master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 
--executor-memory 4G --num-executors 1 --total-executor-cores 1 
/opt/flowSpark/sparkStream/ForAsk01.py
12578 ?Sl13:16 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf

Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh

what does jps returning?

jps
16738 ResourceManager
14786 Worker
17059 JobHistoryServer
12421 QuorumPeerMain
9061 RunJar
9286 RunJar
5190 SparkSubmit
16806 NodeManager
16264 DataNode
16138 NameNode
16430 SecondaryNameNode
22036 SparkSubmit
9557 Jps
13240 Kafka
2522 Master

and

ps -awx | grep -i spark | grep java


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 03:01, 李明伟  wrote:

> Hi Mich
>
> From the ps command. I can find four process. 10409 is the master and
> 10603 is the worker. 12420 is the driver program and 12578 should be the
> executor (worker). Am I right?
> So you mean the 12420 is actually running both the driver and the worker
> role?
>
> [root@ES01 ~]# ps -awx | grep spark | grep java
> 10409 ?Sl 1:40 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
> --ip ES01 --port 7077 --webui-port 8080
> 10603 ?Sl 6:00 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
> --webui-port 8081 spark://ES01:7077
> 12420 ?Sl 6:34 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit
> --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2
> --executor-memory 4G --num-executors 1 --total-executor-cores 1
> /opt/flowSpark/sparkStream/ForAsk01.py
> 12578 ?Sl13:16 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname
> 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url
> spark://Worker@10.79.148.184:52660
>
>
>
>
>
>
>
> At 2016-05-11 09:03:21, "Mich Talebzadeh" 
> wrote:
>
> hm,
>
> This is a standalone mode.
>
> When you are running Spark in Standalone mode, you only have one worker
> that lives within the driver JVM process that you start when you start
> spark-shell or spark-submit.
>
> However, since driver-memory setting encapsulates the JVM, you will need
> to set the amount of *driver memory *for any non-default value *before
> starting JVM by providing the new value:*
>
>
>
>
> ${SPARK_HOME}/bin/spark-submit --driver-memory 5g
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 May 2016 at 01:22, 李明伟  wrote:
>
>> I actually provided them in submit command here:
>>
>> nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory
>> 4G --num-executors 1 --total-executor-cores 1 --conf
>> "spark.storage.memoryFraction=0.2"  ./mycode.py 1>a.log 2>b.log &
>>
>>
>>
>>
>>
>>
>>
>> At 2016-05-10 21:19:06, "Mich Talebzadeh" 
>> wrote:
>>
>> Hi Mingwei,
>>
>> In your Spark conf setting what are you providing for these parameters. *Are
>> you capping them?*
>>
>> For example
>>
>>   val conf = new SparkConf().
>>setAppName("AppName").
>>setMaster("local[2]").
>>set("spark.executor.memory", "4G").
>>set("spark.cores.max", "2").
>>set("spark.driver.allowMultipleContexts", "true")
>>   val sc = new SparkContext(conf)
>>
>> I assume you are running in standalone mode so each worker/aka
>> slave grabs all the available cores and allocates the rem

Re:Re: Will the HiveContext cause memory leak ?

Hi  Ted


Spark version :  spark-1.6.0-bin-hadoop2.6
I tried increase the memory of executor. Still have the same problem.
I can use jmap to capture some thing. But the output is too difficult to 
understand. 










在 2016-05-11 11:50:14，"Ted Yu"  写道：

Which Spark release are you using ?


I assume executor crashed due to OOME.


Did you have a chance to capture jmap on the executor before it crashed ?


Have you tried giving more memory to the executor ?


Thanks


On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com wrote:
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py

Please advice me if it is a bug..

Also here is the submit command

nohup ./bin/spark-submit  \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to resolve Scheduling delay in Spark streaming applications?

2016-05-10 Thread Hemalatha A

Hello,

We are facing large  Scheduling delay in our  Spark streaming application.
Not sure how to debug why the delay is happening. We have all the tuning
possible on Spark side.

Can someone advice how to debug the cause of the delay and some tips for
resolving it please?

-- 


Regards
Hemalatha

Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu

Which Spark release are you using ?

I assume executor crashed due to OOME.

Did you have a chance to capture jmap on the executor before it crashed ?

Have you tried giving more memory to the executor ?

Thanks

On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com 
wrote:

> I submit my code to a spark stand alone cluster. Find the memory usage
> executor process keeps growing. Which cause the program to crash.
>
> I modified the code and submit several times. Find below 4 line may causing
> the issue
>
> dataframe =
>
> dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
> windowSpec =
> Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
> rank = func.dense_rank().over(windowSpec)
> ret =
>
> dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
> rank.alias('rank')).filter("rank<=2")
>
> It looks a little complicated but it is just some Window function on
> dataframe. I use the HiveContext because SQLContext do not support window
> function yet. Without the 4 line, my code can run all night. Adding them
> will cause the memory leak. Program will crash in a few hours.
>
> I will provided the whole code (50 lines)here.  ForAsk01.py
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py
> >
> Please advice me if it is a bug..
>
> Also here is the submit command
>
> nohup ./bin/spark-submit  \
> --master spark://ES01:7077 \
> --executor-memory 4G \
> --num-executors 1 \
> --total-executor-cores 1 \
> --conf "spark.storage.memoryFraction=0.2"  \
> ./ForAsk.py 1>a.log 2>b.log &
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com

I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
  
Please advice me if it is a bug..

Also here is the submit command 

nohup ./bin/spark-submit  \  
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

What does the spark stand alone cluster do?

2016-05-10 Thread kramer2...@126.com

Hello.

My question here is what the spark stand alone cluster do here. Because when
we submit program like below

./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 --conf
"spark.storage.memoryFraction=0.2" 


We specified the resource allocation manually
We specified the config manually 

Then what the cluster do here?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-does-the-spark-stand-alone-cluster-do-tp26920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re:Re: Re: Re: Re: Re: How big the spark stream window could be ?

Hi Mich


From the ps command. I can find four process. 10409 is the master and 10603 is 
the worker. 12420 is the driver program and 12578 should be the executor 
(worker). Am I right? 
So you mean the 12420 is actually running both the driver and the worker role?


[root@ES01 ~]# ps -awx | grep spark | grep java
10409 ?Sl 1:40 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip 
ES01 --port 7077 --webui-port 8080
10603 ?Sl 6:00 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://ES01:7077
12420 ?Sl 6:34 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit 
--master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 
--executor-memory 4G --num-executors 1 --total-executor-cores 1 
/opt/flowSpark/sparkStream/ForAsk01.py
12578 ?Sl13:16 java -cp 
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
 -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname 
10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url 
spark://Worker@10.79.148.184:52660










At 2016-05-11 09:03:21, "Mich Talebzadeh"  wrote:

hm,


This is a standalone mode.


When you are running Spark in Standalone mode, you only have one worker that 
lives within the driver JVM process that you start when you start spark-shell 
or spark-submit.



However, since driver-memory setting encapsulates the JVM, you will need to set 
the amount of driver memory for any non-default value before starting JVM by 
providing the new value:










${SPARK_HOME}/bin/spark-submit --driver-memory 5g










 













Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 11 May 2016 at 01:22, 李明伟  wrote:

I actually provided them in submit command here:


nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G 
--num-executors 1 --total-executor-cores 1 --conf 
"spark.storage.memoryFraction=0.2"  ./mycode.py1>a.log 2>b.log &










At 2016-05-10 21:19:06, "Mich Talebzadeh"  wrote:

Hi Mingwei,


In your Spark conf setting what are you providing for these parameters. Are you 
capping them?


For example


  val conf = new SparkConf().
   setAppName("AppName").
   setMaster("local[2]").
   set("spark.executor.memory", "4G").
   set("spark.cores.max", "2").
   set("spark.driver.allowMultipleContexts", "true")
  val sc = new SparkContext(conf)


I assume you are running in standalone mode so each worker/aka slave grabs all 
the available cores and allocates the remaining memory on each host. Do not 
provide these in


Do not provide new values for these parameter meaning overwrite them in

${SPARK_HOME}/bin/spark-submit  --




HTH

























Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 10 May 2016 at 03:12, 李明伟  wrote:

Hi Mich


I added some more infor (the spark-env.sh setting and top command output in 
that thread.) Can you help to check pleas?


Regards
Mingwei






At 2016-05-09 23:45:19, "Mich Talebzadeh"  wrote:

I had a look at the thread.


This is what you have which I gather a standalone box in other words one worker 
node


bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G 
--num-executors 1 --total-executor-cores 1

Unable to write stream record to cassandra table with multiple columns

2016-05-10 Thread Anand N Ilkal

I am trying to write incoming stream data to database. Following is the example 
program, this code creates a thread to listen to incoming stream of data which 
is csv data. this data needs to be split with delimiter and the array of data 
needs to be pushed to database as separate columns in the TABLE.

object dbwrite {
  case class record(id: Long, time: java.sql.Timestamp, rx: Int, tx: Int, 
total: Int, multi: Double)
  def main(args: Array[String]) {
if (args.length < 2) {
  System.err.println("Usage: CustomReceiver  ")
  System.exit(1)
}

// Create the context with a 1 second batch size
val sparkConf = new SparkConf()
.set(“spark.cassandra.connection.host", "localhost")
.setAppName("dbwrite")
.set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val sc = ssc.sparkContext

// Create a input stream with the custom receiver on target ip:port and 
count the
// words in input stream of \n delimited text (eg. generated by 'nc')
val lines = ssc.receiverStream(new CustomReceiver(args(0), args(1).toInt))
val splitRdd = lines.map(line => line.split(",") )
//val wordCounts = splitRdd.map(x => (x, 1)).reduceByKey(_ + _)
// RDD[Array[String]

val yourRdd = splitRdd.flatMap(arr => {
  val id = arr(0).toLong
  val rx = arr(2).toInt
  val tx = arr(3).toInt
  val total = arr(4).toInt
  val mul = arr(5).toInt
  val parsedDate = new java.util.Date()
  val timestamp = new java.sql.Timestamp(parsedDate.getTime());
  val reco = records(id, timestamp, rx, tx, total, mul);
  Seq(reco)
})

yourRdd.foreachRDD { rdd =>
for(item <- rdd.collect().toArray)
  print(item)
}
val rec = sc.parallelize(Seq(yourRdd))
rec.saveToCassandra("records", "record", SomeColumns(“id”, "time", "rx", 
"tx", "total”, "multi"))

ssc.start()
ssc.awaitTermination()
  }
}
but spark does gives following error -
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Columns not found in 
org.apache.spark.streaming.dstream.DStream[dbwrite.records]: [mdn, time, rx, 
tx, total, multi]
at scala.Predef$.require(Predef.scala:233)
at 
com.datastax.spark.connector.mapper.DefaultColumnMapper.columnMapForWriting(DefaultColumnMapper.scala:108)
at 
com.datastax.spark.connector.writer.MappedToGettableDataConverter$$anon$1.(MappedToGettableDataConverter.scala:29)
at 
com.datastax.spark.connector.writer.MappedToGettableDataConverter$.apply(MappedToGettableDataConverter.scala:20)
at 
com.datastax.spark.connector.writer.DefaultRowWriter.(DefaultRowWriter.scala:17)
at 
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:31)
at 
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:29)
at 
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:272)
at 
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at dbwrite$.main(dbwrite.scala:63)
at dbwrite.main(dbwrite.scala)
i am using spark-1.6.1 and cassandra 3.5
the TABLE already created on cassandra has same column names. But the column 
display in alphabetical order, but all columns are avaialble.
help me with the error.

thanks.

RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller

Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Monday, May 9, 2016 9:28 PM
To: u...@cassandra.apache.org; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
mailto:lcas...@gmail.com>> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798

RE: Reading table schema from Cassandra

2016-05-10 Thread Mohammed Guller

You can create a DataFrame directly from a Cassandra table using something like 
this:

val dfCassTable = 
sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace")).load()

Then, you can get schema:
val dfCassTableSchema = dfCassTable.schema

Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: justneeraj [mailto:justnee...@gmail.com] 
Sent: Tuesday, May 10, 2016 2:22 AM
To: user@spark.apache.org
Subject: Reading table schema from Cassandra

Hi,

We are using Spark Cassandra connector for our app. 

And I am trying to create higher level roll up tables. e.g. minutes table from 
seconds table. 

If my tables are already defined. How can I read the schema of table?
So that I can load them in the Dataframe and create the aggregates. 

Any help will be really thankful. 

Thanks,
Neeraj 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-table-schema-from-Cassandra-tp26915.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh

hm,

This is a standalone mode.

When you are running Spark in Standalone mode, you only have one worker
that lives within the driver JVM process that you start when you start
spark-shell or spark-submit.

However, since driver-memory setting encapsulates the JVM, you will need to
set the amount of *driver memory *for any non-default value *before
starting JVM by providing the new value:*




${SPARK_HOME}/bin/spark-submit --driver-memory 5g










Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 01:22, 李明伟  wrote:

> I actually provided them in submit command here:
>
> nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
> --num-executors 1 --total-executor-cores 1 --conf
> "spark.storage.memoryFraction=0.2"  ./mycode.py 1>a.log 2>b.log &
>
>
>
>
>
>
>
> At 2016-05-10 21:19:06, "Mich Talebzadeh" 
> wrote:
>
> Hi Mingwei,
>
> In your Spark conf setting what are you providing for these parameters. *Are
> you capping them?*
>
> For example
>
>   val conf = new SparkConf().
>setAppName("AppName").
>setMaster("local[2]").
>set("spark.executor.memory", "4G").
>set("spark.cores.max", "2").
>set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>
> I assume you are running in standalone mode so each worker/aka
> slave grabs all the available cores and allocates the remaining memory on
> each host. Do not provide these in
>
> Do not provide new values for these parameter meaning overwrite them in
>
> *${SPARK_HOME}/bin/spark-submit  --*
>
>
> HTH
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 May 2016 at 03:12, 李明伟  wrote:
>
>> Hi Mich
>>
>> I added some more infor (the spark-env.sh setting and top command output
>> in that thread.) Can you help to check pleas?
>>
>> Regards
>> Mingwei
>>
>>
>>
>>
>>
>> At 2016-05-09 23:45:19, "Mich Talebzadeh" 
>> wrote:
>>
>> I had a look at the thread.
>>
>> This is what you have which I gather a standalone box in other words one
>> worker node
>>
>> bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
>> --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log
>>
>> But what I don't understand why is using 80% of your RAM as opposed to
>> 25% of it (4GB/16GB) right?
>>
>> Where else have you set up these parameters for example in
>> $SPARK_HOME/con/spark-env.sh?
>>
>> Can you send the output of /usr/bin/free and top
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 May 2016 at 16:19, 李明伟  wrote:
>>
>>> Thanks for all the information guys.
>>>
>>> I wrote some code to do the test. Not using window. So only calculating
>>> data for each batch interval. I set the interval to 30 seconds also reduce
>>> the size of data to about 30 000 lines of csv.
>>> Means my code should calculation on 30 000 lines of CSV in 30 seconds. I
>>> think it is not a very heavy workload. But my spark stream code still crash.
>>>
>>> I send another post to the user list here
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html
>>>
>>> Is it possible for you to have a look please? Very appreciate.
>>>
>>>
>>>
>>>
>>>
>>> At 2016-05-09 17:49:22, "Saisai Shao"  wrote:
>>>
>>> Pease see the inline comments.
>>>
>>>
>>> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar 
>>> wrote:
>>>
 Thank you.

 So If I create spark streaming then


1. The streams will always need to be cached? It cannot be stored
in persistent storage

 You don't need to cache the stream explicitly if you don't have
>>> specific requirement, Spark will do it for you depends on different
>>> streaming sources (Kafka or socket).
>>>

1. The stream data cached will be distributed among all nodes of
Spark among executors
2. As I understand each Spark worker node has one executor that
includes cache. So the streaming data is distributed among these work 
 node
caches. For example if I have 4 worker nodes each cache will have a 
 quarter
of data (this assumes that cache size among worker nodes is the same.)

 Ideally, it will distributed evenly across the executors, also this is
>>> target for tuning. Normally it depen

Spark 1.6 Catalyst optimizer

2016-05-10 Thread Telmo Rodrigues

Hello,

I have a question about the Catalyst optimizer in Spark 1.6.

initial logical plan:

!'Project [unresolvedalias(*)]
!+- 'Filter ('t.id = 1)
!   +- 'Join Inner, Some(('t.id = 'u.id))
!  :- 'UnresolvedRelation `t`, None
!  +- 'UnresolvedRelation `u`, None


logical plan after optimizer execution:

Project [id#0L,id#1L]
!+- Filter (id#0L = cast(1 as bigint))
!   +- Join Inner, Some((id#0L = id#1L))
!  :- Subquery t
!  :  +- Relation[id#0L] JSONRelation
!  +- Subquery u
!  +- Relation[id#1L] JSONRelation


Shouldn't the optimizer push down predicates to subquery t in order to the
filter be executed before join?

Thanks

Re:Re: Re: Re: Re: How big the spark stream window could be ?

I actually provided them in submit command here:

nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 --conf
"spark.storage.memoryFraction=0.2" ./mycode.py1>a.log 2>b.log &

At 2016-05-10 21:19:06, "Mich Talebzadeh" wrote:

Hi Mingwei,

In your Spark conf setting what are you providing for these parameters. Are you
capping them?

For example

val conf = new SparkConf().
setAppName("AppName").
setMaster("local[2]").
set("spark.executor.memory", "4G").
set("spark.cores.max", "2").
set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(conf)

I assume you are running in standalone mode so each worker/aka slave grabs all
the available cores and allocates the remaining memory on each host. Do not
provide these in

Do not provide new values for these parameter meaning overwrite them in

${SPARK_HOME}/bin/spark-submit --

HTH

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 10 May 2016 at 03:12, 李明伟 wrote:

Hi Mich

I added some more infor (the spark-env.sh setting and top command output in
that thread.) Can you help to check pleas?

Regards
Mingwei

At 2016-05-09 23:45:19, "Mich Talebzadeh" wrote:

I had a look at the thread.

This is what you have which I gather a standalone box in other words one worker
node

bin/spark-submit --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log

But what I don't understand why is using 80% of your RAM as opposed to 25% of
it (4GB/16GB) right?

Where else have you set up these parameters for example in
$SPARK_HOME/con/spark-env.sh?

Can you send the output of /usr/bin/free and top

HTH

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 9 May 2016 at 16:19, 李明伟 wrote:

Thanks for all the information guys.

I wrote some code to do the test. Not using window. So only calculating data
for each batch interval. I set the interval to 30 seconds also reduce the size
of data to about 30 000 lines of csv.
Means my code should calculation on 30 000 lines of CSV in 30 seconds. I think
it is not a very heavy workload. But my spark stream code still crash.

I send another post to the user list here
http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html

Is it possible for you to have a look please? Very appreciate.

At 2016-05-09 17:49:22, "Saisai Shao" wrote:

Pease see the inline comments.

On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar wrote:

Thank you.

So If I create spark streaming then

The streams will always need to be cached? It cannot be stored in persistent
storage
You don't need to cache the stream explicitly if you don't have specific
requirement, Spark will do it for you depends on different streaming sources
(Kafka or socket).
The stream data cached will be distributed among all nodes of Spark among
executors
As I understand each Spark worker node has one executor that includes cache. So
the streaming data is distributed among these work node caches. For example if
I have 4 worker nodes each cache will have a quarter of data (this assumes that
cache size among worker nodes is the same.)
Ideally, it will distributed evenly across the executors, also this is target
for tuning. Normally it depends on several conditions like receiver
distribution, partition distribution.

The issue raises if the amount of streaming data does not fit into these 4
caches? Will the job crash?

On Monday, 9 May 2016, 10:16, Saisai Shao wrote:

No, each executor only stores part of data in memory (it depends on how the
partition are distributed and how many receivers you have).

For WindowedDStream, it will obviously cache the data in memory, from my
understanding you don't need to call cache() again.

On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar wrote:

hi,

so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?

also in that case why do we need using

dstream.cache()

thanks

On Monday, 9 May 2016, 9:58, Saisai Shao wrote:

It depends on you to write the Spark application, normally if data is already
on the persistent storage, there's no need to be put into memory. The reason
why Spark Streaming has to be stored in memory is that streaming source is not
persistent source, so you need to have a place to store the data.

On Mon, May 9, 2016 at 4:43 PM, 李明伟 wrote:

Thanks.
What if I use batch calculation instead of stream computing? Do I still need
that much memory? For example, if the 24 hour data set is 100

Re: Cluster Migration

Never mind! I figured it out by saving it as hadoopfile and passing the
codec to it. Thank you!

On Tuesday, May 10, 2016, Ajay Chander  wrote:

> Hi, I have a folder temp1 in hdfs which have multiple format files
> test1.txt, test2.avsc (Avro file) in it. Now I want to compress these files
> together and store it under temp2 folder in hdfs. Expecting that temp2
> folder will have one file test_compress.gz which has test1.txt and
> test2.avsc under it. Is there any possible/effiencient way to achieve this?
>
> Thanks,
> Aj
>
> On Tuesday, May 10, 2016, Ajay Chander  > wrote:
>
>> I will try that out. Thank you!
>>
>> On Tuesday, May 10, 2016, Deepak Sharma  wrote:
>>
>>> Yes that's what I intended to say.
>>>
>>> Thanks
>>> Deepak
>>> On 10 May 2016 11:47 pm, "Ajay Chander"  wrote:
>>>
 Hi Deepak,
Thanks for your response. If I am correct, you suggest reading
 all of those files into an rdd on the cluster using wholeTextFiles then
 apply compression codec on it, save the rdd to another Hadoop cluster?

 Thank you,
 Ajay

 On Tuesday, May 10, 2016, Deepak Sharma  wrote:

> Hi Ajay
> You can look at wholeTextFiles method of rdd[string,string] and then
> map each of rdd  to saveAsTextFile .
> This will serve the purpose .
> I don't think if anything default like distcp exists in spark
>
> Thanks
> Deepak
> On 10 May 2016 11:27 pm, "Ajay Chander"  wrote:
>
>> Hi Everyone,
>>
>> we are planning to migrate the data between 2 clusters and I see
>> distcp doesn't support data compression. Is there any efficient way to
>> compress the data during the migration ? Can I implement any spark job to
>> do this ? Thanks.
>>
>

Re: SparkSQL with large result size

2016-05-10 Thread Buntu Dev

Thanks Chris for pointing out the issue. I think I was able to get over
this issue by:

- repartitioning to increase the number of partitions (about 6k partitions)
- apply sort() on the resulting dataframe to coalesce into single sorted
partition file
- read the sorted file and then adding just limit() to get the desired
number of rows seem to have worked

Thanks everyone for the input!

On Tue, May 10, 2016 at 1:20 AM, Christophe Préaud <
christophe.pre...@kelkoo.com> wrote:

> Hi,
>
> You may be hitting this bug: SPARK-9879
> 
>
> In other words: did you try without the LIMIT clause?
>
> Regards,
> Christophe.
>
>
> On 02/05/16 20:02, Gourav Sengupta wrote:
>
> Hi,
>
> I have worked on 300GB data by querying it  from CSV (using SPARK CSV)
>  and writing it to Parquet format and then querying parquet format to query
> it and partition the data and write out individual csv files without any
> issues on a single node SPARK cluster installation.
>
> Are you trying to cache in the entire data? What is that you are trying to
> achieve in your used case?
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 5:59 PM, Ted Yu  wrote:
>
>> That's my interpretation.
>>
>> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev < 
>> buntu...@gmail.com> wrote:
>>
>>> Thanks Ted, I thought the avg. block size was already low and less than
>>> the usual 128mb. If I need to reduce it further via parquet.block.size, it
>>> would mean an increase in the number of blocks and that should increase the
>>> number of tasks/executors. Is that the correct way to interpret this?
>>>
>>> On Mon, May 2, 2016 at 6:21 AM, Ted Yu < 
>>> yuzhih...@gmail.com> wrote:
>>>
 Please consider decreasing block size.

 Thanks

 > On May 1, 2016, at 9:19 PM, Buntu Dev < 
 buntu...@gmail.com> wrote:
 >
 > I got a 10g limitation on the executors and operating on parquet
 dataset with block size 70M with 200 blocks. I keep hitting the memory
 limits when doing a 'select * from t1 order by c1 limit 100' (ie, 1M).
 It works if I limit to say 100k. What are the options to save a large
 dataset without running into memory issues?
 >
 > Thanks!

>>>
>>>
>>
>
>
> --
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>

Re: Cluster Migration

Hi, I have a folder temp1 in hdfs which have multiple format files
test1.txt, test2.avsc (Avro file) in it. Now I want to compress these files
together and store it under temp2 folder in hdfs. Expecting that temp2
folder will have one file test_compress.gz which has test1.txt and
test2.avsc under it. Is there any possible/effiencient way to achieve this?

Thanks,
Aj

On Tuesday, May 10, 2016, Ajay Chander  wrote:

> I will try that out. Thank you!
>
> On Tuesday, May 10, 2016, Deepak Sharma  > wrote:
>
>> Yes that's what I intended to say.
>>
>> Thanks
>> Deepak
>> On 10 May 2016 11:47 pm, "Ajay Chander"  wrote:
>>
>>> Hi Deepak,
>>>Thanks for your response. If I am correct, you suggest reading
>>> all of those files into an rdd on the cluster using wholeTextFiles then
>>> apply compression codec on it, save the rdd to another Hadoop cluster?
>>>
>>> Thank you,
>>> Ajay
>>>
>>> On Tuesday, May 10, 2016, Deepak Sharma  wrote:
>>>
 Hi Ajay
 You can look at wholeTextFiles method of rdd[string,string] and then
 map each of rdd  to saveAsTextFile .
 This will serve the purpose .
 I don't think if anything default like distcp exists in spark

 Thanks
 Deepak
 On 10 May 2016 11:27 pm, "Ajay Chander"  wrote:

> Hi Everyone,
>
> we are planning to migrate the data between 2 clusters and I see
> distcp doesn't support data compression. Is there any efficient way to
> compress the data during the migration ? Can I implement any spark job to
> do this ? Thanks.
>

Reliability of JMS Custom Receiver in Spark Streaming JMS

2016-05-10 Thread Sourav Mazumder

Hi,

Need to get bit more understanding of reliability aspects of the Custom
Receivers in the context of the code in spark-streaming-jms
https://github.com/mattf/spark-streaming-jms.

Based on the documentation in
http://spark.apache.org/docs/latest/streaming-custom-receivers.html#receiver-reliability,
I understand that if the store api is called with multiple records the
message is reliably stored as it is a blocking call. On the other hand if
the store api is called with a single record then it is not reliable as the
call is returned back to the calling program before the message is stored
appropriately.

Given that I have few questions

1. Which are the store APIs that relate to multiple records ? Are they the
ones which use scala.collection.mutable.ArrayBufferhttp://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/receiver/Receiver.html>>,

scala.collection.Iteratorhttp://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/receiver/Receiver.html>>
and
java.util.Iteratorhttp://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/receiver/Receiver.html>>
in the parameter signature?

2. Is there a sample code which can show how to create multiple records
like that and send the same to appropriate store API ?

3. If I take the example of spark-streaming-jms, the onMessage method of
JMSReceiver class calls store API with one JMSEvent. Does that mean that
this code does not guarantee the reliability of storage of the message
received even if storage level specified to MEMORY_AND_DISK_SER_2 ?

Regards,
Sourav

Re: Save DataFrame to HBase

2016-05-10 Thread Ted Yu

I think so.

Please refer to the table population tests in (master branch):
hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/DefaultSourceSuite.scala

Cheers

On Tue, May 10, 2016 at 2:53 PM, Benjamin Kim  wrote:

> Ted,
>
> Will the hbase-spark module allow for creating tables in Spark SQL that
> reference the hbase tables underneath? In this way, users can query using
> just SQL.
>
> Thanks,
> Ben
>
> On Apr 28, 2016, at 3:09 AM, Ted Yu  wrote:
>
> Hbase 2.0 release likely would come after Spark 2.0 release.
>
> There're other features being developed in hbase 2.0
> I am not sure when hbase 2.0 would be released.
>
> The refguide is incomplete.
> Zhan has assigned the doc JIRA to himself. The documentation would be done
> after fixing bugs in hbase-spark module.
>
> Cheers
>
> On Apr 27, 2016, at 10:31 PM, Benjamin Kim  wrote:
>
> Hi Ted,
>
> Do you know when the release will be? I also see some documentation for
> usage of the hbase-spark module at the hbase website. But, I don’t see an
> example on how to save data. There is only one for reading/querying data.
> Will this be added when the final version does get released?
>
> Thanks,
> Ben
>
> On Apr 21, 2016, at 6:56 AM, Ted Yu  wrote:
>
> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can
> do this.
>
> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim  wrote:
>
>> Has anyone found an easy way to save a DataFrame into HBase?
>>
>> Thanks,
>> Ben
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>

Re: Save DataFrame to HBase

2016-05-10 Thread Benjamin Kim

Ted,

Will the hbase-spark module allow for creating tables in Spark SQL that 
reference the hbase tables underneath? In this way, users can query using just 
SQL.

Thanks,
Ben

> On Apr 28, 2016, at 3:09 AM, Ted Yu  wrote:
> 
> Hbase 2.0 release likely would come after Spark 2.0 release. 
> 
> There're other features being developed in hbase 2.0
> I am not sure when hbase 2.0 would be released. 
> 
> The refguide is incomplete. 
> Zhan has assigned the doc JIRA to himself. The documentation would be done 
> after fixing bugs in hbase-spark module. 
> 
> Cheers
> 
> On Apr 27, 2016, at 10:31 PM, Benjamin Kim  > wrote:
> 
>> Hi Ted,
>> 
>> Do you know when the release will be? I also see some documentation for 
>> usage of the hbase-spark module at the hbase website. But, I don’t see an 
>> example on how to save data. There is only one for reading/querying data. 
>> Will this be added when the final version does get released?
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 21, 2016, at 6:56 AM, Ted Yu >> > wrote:
>>> 
>>> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can 
>>> do this.
>>> 
>>> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim >> > wrote:
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> 
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> 
>>> 
>>> 
>>

Re: Pyspark accumulator



On May 10, 2016 2:24:41 PM EDT, Abi  wrote:
>1. How come pyspark does not provide the localvalue function like scala
>?
>
>2. Why is pyspark more restrictive than scala ?

Re: Accumulator question



On May 9, 2016 8:24:06 PM EDT, Abi  wrote:
>I am splitting an integer array in 2 partitions and using an
>accumulator  to sum the array. problem is
>
>1.  I am not seeing execution time becoming half of a linear summing.
>
>2. The second node (from looking at timestamps) takes 3 times as long
>as the first node. This gives the impression it is "waiting" for the
>first node to finish.
>
>Hence,  I am given the impression using accumulator.sum () in the
>kernel and rdd.foreach (kernel) is making things sequential. 
>
>Any api/setting suggestions where I could make things parallel ?
>
>
>

Re: pyspark mappartions ()



On May 10, 2016 2:20:25 PM EDT, Abi  wrote:
>Is there any example of this ? I want to see how you write the the
>iterable example

Re: Evenly balance the number of items in each RDD partition

Hi Xinh,

Thanks! Custom partitioner with partitionBy() did the job.


On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh  wrote:

> Hi Ayman,
>
> Have you looked at this:
> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
>
> It recommends defining a custom partitioner and (PairRDD) partitionBy
> method to accomplish this.
>
> Xinh
>
> On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil 
> wrote:
>
>> And btw, I'm using the Python API if this makes any difference.
>>
>> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
>> wrote:
>>
>>> Hi Don,
>>>
>>> This didn't help. My original rdd is already created using 10
>>> partitions. As a matter of fact, after trying with rdd.coalesce(10,
>>> shuffle = true) out of curiosity, the rdd partitions became even more
>>> imbalanced:
>>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096),
>>> (6, 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>>
>>>
>>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>>
 You can call rdd.coalesce(10, shuffle = true) and the returning rdd
 will be evenly balanced.  This obviously triggers a shuffle, so be advised
 it could be an expensive operation depending on your RDD size.

 -Don

 On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
 wrote:

> Hello,
>
> I have 50,000 items parallelized into an RDD with 10 partitions, I
> would like to evenly split the items over the partitions so:
> 50,000/10 = 5,000 in each RDD partition.
>
> What I get instead is the following (partition index, partition count):
> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
> 5120), (7, 5120), (8, 5120), (9, 4944)]
>
> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the
> partitions are imbalanced.
>
> Is there a way to do that?
>
> Thank you,
> Ayman
>



 --
 Donald Drake
 Drake Consulting
 http://www.drakeconsulting.com/
 https://twitter.com/dondrake 
 800-733-2143

>>>
>>>
>>
>

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake

Well, for Python, it should be rdd.coalesce(10, shuffle=True)

I have had good success with this using the Scala API in Spark 1.6.1.

-Don

On Tue, May 10, 2016 at 3:15 PM, Ayman Khalil  wrote:

> And btw, I'm using the Python API if this makes any difference.
>
> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
> wrote:
>
>> Hi Don,
>>
>> This didn't help. My original rdd is already created using 10 partitions.
>> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
>> true) out of curiosity, the rdd partitions became even more imbalanced:
>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
>> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>
>>
>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>
>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>>> could be an expensive operation depending on your RDD size.
>>>
>>> -Don
>>>
>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>>> wrote:
>>>
 Hello,

 I have 50,000 items parallelized into an RDD with 10 partitions, I
 would like to evenly split the items over the partitions so:
 50,000/10 = 5,000 in each RDD partition.

 What I get instead is the following (partition index, partition count):
 [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
 5120), (7, 5120), (8, 5120), (9, 4944)]

 the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
 are imbalanced.

 Is there a way to do that?

 Thank you,
 Ayman

>>>
>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake 
>>> 800-733-2143
>>>
>>
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143

Not able pass 3rd party jars to mesos executors

2016-05-10 Thread gpatcham

Hi All,

I'm using --jars option in spark-submit to send 3rd party jars . But I don't
see they are actually passed to mesos slaves. Getting Noclass found
exceptions.

This is how I'm using --jars option

--jars hdfs://namenode:8082/user/path/to/jar

Am I missing something here or what's the correct  way to do ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Xinh Huynh

Hi Ayman,

Have you looked at this:
http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where

It recommends defining a custom partitioner and (PairRDD) partitionBy
method to accomplish this.

Xinh

On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil  wrote:

> And btw, I'm using the Python API if this makes any difference.
>
> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
> wrote:
>
>> Hi Don,
>>
>> This didn't help. My original rdd is already created using 10 partitions.
>> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
>> true) out of curiosity, the rdd partitions became even more imbalanced:
>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
>> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>
>>
>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>
>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>>> could be an expensive operation depending on your RDD size.
>>>
>>> -Don
>>>
>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>>> wrote:
>>>
 Hello,

 I have 50,000 items parallelized into an RDD with 10 partitions, I
 would like to evenly split the items over the partitions so:
 50,000/10 = 5,000 in each RDD partition.

 What I get instead is the following (partition index, partition count):
 [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
 5120), (7, 5120), (8, 5120), (9, 4944)]

 the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
 are imbalanced.

 Is there a way to do that?

 Thank you,
 Ayman

>>>
>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake 
>>> 800-733-2143
>>>
>>
>>
>

Re: Evenly balance the number of items in each RDD partition

And btw, I'm using the Python API if this makes any difference.

On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
wrote:

> Hi Don,
>
> This didn't help. My original rdd is already created using 10 partitions.
> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
> true) out of curiosity, the rdd partitions became even more imbalanced:
> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>
>
> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>
>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>> could be an expensive operation depending on your RDD size.
>>
>> -Don
>>
>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>> wrote:
>>
>>> Hello,
>>>
>>> I have 50,000 items parallelized into an RDD with 10 partitions, I would
>>> like to evenly split the items over the partitions so:
>>> 50,000/10 = 5,000 in each RDD partition.
>>>
>>> What I get instead is the following (partition index, partition count):
>>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
>>> 5120), (7, 5120), (8, 5120), (9, 4944)]
>>>
>>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
>>> are imbalanced.
>>>
>>> Is there a way to do that?
>>>
>>> Thank you,
>>> Ayman
>>>
>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake 
>> 800-733-2143
>>
>
>

Re: Evenly balance the number of items in each RDD partition

Hi Don,

This didn't help. My original rdd is already created using 10 partitions.
As a matter of fact, after trying with rdd.coalesce(10, shuffle = true) out
of curiosity, the rdd partitions became even more imbalanced:
[(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
5120), (7, 5120), (8, 5120), (9, *6144*)]


On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:

> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
> be evenly balanced.  This obviously triggers a shuffle, so be advised it
> could be an expensive operation depending on your RDD size.
>
> -Don
>
> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
> wrote:
>
>> Hello,
>>
>> I have 50,000 items parallelized into an RDD with 10 partitions, I would
>> like to evenly split the items over the partitions so:
>> 50,000/10 = 5,000 in each RDD partition.
>>
>> What I get instead is the following (partition index, partition count):
>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
>> 5120), (7, 5120), (8, 5120), (9, 4944)]
>>
>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
>> are imbalanced.
>>
>> Is there a way to do that?
>>
>> Thank you,
>> Ayman
>>
>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake 
> 800-733-2143
>

Spark crashes with Filesystem recovery

2016-05-10 Thread Imran Akbar

I have some Python code that consistently ends up in this state:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
690, in start
self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
690, in start
self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Traceback (most recent call last):
  File "", line 2, in 
  File "/home/ubuntu/spark/python/pyspark/sql/dataframe.py", line 280, in
collect
port = self._jdf.collectToPython()
  File "/home/ubuntu/spark/python/pyspark/traceback_utils.py", line 78, in
__exit__
self._context._jsc.setCallSite(None)
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
811, in __call__
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
624, in send_command
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
579, in _get_connection
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
585, in _create_connection
  File
"/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
697, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect
to the Java server

Even though I start pyspark with these options:
./pyspark --master local[4] --executor-memory 14g --driver-memory 14g
--packages com.databricks:spark-csv_2.11:1.4.0
--spark.deploy.recoveryMode=FILESYSTEM

and this in my /conf/spark-env.sh file:
- SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM
-Dspark.deploy.recoveryDirectory=/user/recovery"

How can I get HA to work in Spark?

thanks,
imran

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake

You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be
evenly balanced.  This obviously triggers a shuffle, so be advised it could
be an expensive operation depending on your RDD size.

-Don

On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
wrote:

> Hello,
>
> I have 50,000 items parallelized into an RDD with 10 partitions, I would
> like to evenly split the items over the partitions so:
> 50,000/10 = 5,000 in each RDD partition.
>
> What I get instead is the following (partition index, partition count):
> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
> 5120), (7, 5120), (8, 5120), (9, 4944)]
>
> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
> are imbalanced.
>
> Is there a way to do that?
>
> Thank you,
> Ayman
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143

Hi test

Hello test

Pyspark accumulator

1. How come pyspark does not provide the localvalue function like scala ?

2. Why is pyspark more restrictive than scala ?

Re: Cluster Migration

Hi Deepak,
   Thanks for your response. If I am correct, you suggest reading all
of those files into an rdd on the cluster using wholeTextFiles then apply
compression codec on it, save the rdd to another Hadoop cluster?

Thank you,
Ajay

On Tuesday, May 10, 2016, Deepak Sharma  wrote:

> Hi Ajay
> You can look at wholeTextFiles method of rdd[string,string] and then map
> each of rdd  to saveAsTextFile .
> This will serve the purpose .
> I don't think if anything default like distcp exists in spark
>
> Thanks
> Deepak
> On 10 May 2016 11:27 pm, "Ajay Chander"  > wrote:
>
>> Hi Everyone,
>>
>> we are planning to migrate the data between 2 clusters and I see distcp
>> doesn't support data compression. Is there any efficient way to compress
>> the data during the migration ? Can I implement any spark job to do this ?
>>  Thanks.
>>
>

Re: Cluster Migration

I will try that out. Thank you!

On Tuesday, May 10, 2016, Deepak Sharma  wrote:

> Yes that's what I intended to say.
>
> Thanks
> Deepak
> On 10 May 2016 11:47 pm, "Ajay Chander"  > wrote:
>
>> Hi Deepak,
>>Thanks for your response. If I am correct, you suggest reading all
>> of those files into an rdd on the cluster using wholeTextFiles then apply
>> compression codec on it, save the rdd to another Hadoop cluster?
>>
>> Thank you,
>> Ajay
>>
>> On Tuesday, May 10, 2016, Deepak Sharma > > wrote:
>>
>>> Hi Ajay
>>> You can look at wholeTextFiles method of rdd[string,string] and then map
>>> each of rdd  to saveAsTextFile .
>>> This will serve the purpose .
>>> I don't think if anything default like distcp exists in spark
>>>
>>> Thanks
>>> Deepak
>>> On 10 May 2016 11:27 pm, "Ajay Chander"  wrote:
>>>
 Hi Everyone,

 we are planning to migrate the data between 2 clusters and I see distcp
 doesn't support data compression. Is there any efficient way to compress
 the data during the migration ? Can I implement any spark job to do this ?
  Thanks.

>>>

pyspark mappartions ()

Is there any example of this ? I want to see how you write the the iterable 
example

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma

Yes that's what I intended to say.

Thanks
Deepak
On 10 May 2016 11:47 pm, "Ajay Chander"  wrote:

> Hi Deepak,
>Thanks for your response. If I am correct, you suggest reading all
> of those files into an rdd on the cluster using wholeTextFiles then apply
> compression codec on it, save the rdd to another Hadoop cluster?
>
> Thank you,
> Ajay
>
> On Tuesday, May 10, 2016, Deepak Sharma  wrote:
>
>> Hi Ajay
>> You can look at wholeTextFiles method of rdd[string,string] and then map
>> each of rdd  to saveAsTextFile .
>> This will serve the purpose .
>> I don't think if anything default like distcp exists in spark
>>
>> Thanks
>> Deepak
>> On 10 May 2016 11:27 pm, "Ajay Chander"  wrote:
>>
>>> Hi Everyone,
>>>
>>> we are planning to migrate the data between 2 clusters and I see distcp
>>> doesn't support data compression. Is there any efficient way to compress
>>> the data during the migration ? Can I implement any spark job to do this ?
>>>  Thanks.
>>>
>>

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma

Hi Ajay
You can look at wholeTextFiles method of rdd[string,string] and then map
each of rdd  to saveAsTextFile .
This will serve the purpose .
I don't think if anything default like distcp exists in spark

Thanks
Deepak
On 10 May 2016 11:27 pm, "Ajay Chander"  wrote:

> Hi Everyone,
>
> we are planning to migrate the data between 2 clusters and I see distcp
> doesn't support data compression. Is there any efficient way to compress
> the data during the migration ? Can I implement any spark job to do this ?
>  Thanks.
>

Cluster Migration

Hi Everyone,

we are planning to migrate the data between 2 clusters and I see distcp
doesn't support data compression. Is there any efficient way to compress
the data during the migration ? Can I implement any spark job to do this ?
 Thanks.

Evenly balance the number of items in each RDD partition