Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread François Pelletier
You should aggregate your files in larger chunks before doing anything
else. HDFS is not fit for small files. It will bloat it and cause you a
lot of performance issues. Target a few hundred MB chunks partition size
and then save those files back to hdfs and then delete the original
ones. You can read, use coalesce and the saveAsXXX on the result.

I had the same kind of problem once and solved it in bunching 100's of
files together in larger ones. I used text files with bzip2 compression.



Le 2015-10-20 08:42, Sean Owen a écrit :
> coalesce without a shuffle? it shouldn't be an action. It just treats
> many partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l  > wrote:
>
>
> I have dataset consisting of 5 binary files (each between
> 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes
> of the
> cluster are also the workers for Spark. I open the files as a RDD
> using
> sc.binaryFiles("hdfs:///path_to_directory").When I run the first
> action that
> involves this RDD, Spark spawns a RDD with more than 3
> partitions. And
> this takes ages to process these partitions even if you simply run
> "count".
> Performing a "repartition" directly after loading does not help,
> because
> Spark seems to insist on materializing the RDD created by
> binaryFiles first.
>
> How I can get around this?
>
>
>
> --
> View this message in context:
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>
>



Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier

look at
spark.history.ui.port, if you use standalone
spark.yarn.historyServer.address, if you use YARN

in your Spark config file

Mine is located at
/etc/spark/conf/spark-defaults.conf

If you use Apache Ambari you can find this settings in the Spark /
Configs / Advanced spark-defaults tab

François

Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :
>
> Hello, thank you, but that port is unreachable for me. Can you please
> share where can I find that port equivalent in my environment?
>
>  
>
> Thank you
>
> Saif
>
>  
>
> *From:*François Pelletier [mailto:newslett...@francoispelletier.org]
> *Sent:* Friday, August 07, 2015 4:38 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Spark master driver UI: How to keep it after process
> finished?
>
>  
>
> Hi, all spark processes are saved in the Spark History Server
>
> look at your host on port 18080 instead of 4040
>
> François
>
> Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com
> <mailto:saif.a.ell...@wellsfargo.com> a écrit :
>
> Hi,
>
>  
>
> A silly question here. The Driver Web UI dies when the
> spark-submit program finish. I would like some time to analyze
> after the program ends, as the page does not refresh it self, when
> I hit F5 I lose all the info.
>
>  
>
> Thanks,
>
> Saif
>
>  
>
>  
>



Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
Hi, all spark processes are saved in the Spark History Server

look at your host on port 18080 instead of 4040

François

Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
> Hi,
>  
> A silly question here. The Driver Web UI dies when the spark-submit
> program finish. I would like some time to analyze after the program
> ends, as the page does not refresh it self, when I hit F5 I lose all
> the info.
>  
> Thanks,
> Saif
>