Re: Spark running slow for small hadoop files of 10 mb size

2014-04-24 Thread neeravsalaria
Thanks for the reply. It indeed increased the usage. There was another issue
we found, we were broadcasting hadoop configuration by writing a wrapper
class over it. But found the proper way in Spark Code 

sc.broadcast(new SerializableWritable(conf))





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526p4811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark running slow for small hadoop files of 10 mb size

2014-04-22 Thread Andre Bois-Crettez

The data partitionning is done by default *according to the number of
HDFS blocks* of the source.
You can change the partitionning with .repartion, either to increase or
decrease the level of parallelism :

val recordsRDD =
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256)
val recordsRDDInParallel = recordsRDD.repartition(4*32)
infoRdd = recordsRDDInParallel.map(f => info_func()) hdfs_RDD =
infoRDD.reduceByKey(_+_,48) /* makes 48 paritions*/
hdfs_RDD.saveAsNewAPIHadoopFile



André
On 2014-04-21 13:21, neeravsalaria wrote:

Hi,

   i have been using MapReduce to analyze multiple files whose size can range
from 10 mb to 200mb per file. recently i  planned to move spark , but my
spark Job is taking too much time executing a single file in case my file
size is 10MB and hdfs block size is 64MB. It is executing on a single
datanode and on single core(my cluster is a 4 node setup / each node having
32 cores). each file is having 3 million rows and i have to analyze each
row(ignore none) and create a set of info from it.

Isn't a way where i can parallelize the processing of the file like either
on other nodes or use the remaining cores of the same node.



demo code :

  val recordsRDD =
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) /*to
parallelize */

  infoRdd = recordsRDD.map(f => info_func())

  hdfs_RDD = infoRDD.reduceByKey(_+_,48)  /* makes 48 paritions*/

 hdfs_RDD.saveAsNewAPIHadoopFile



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




--
André Bois-Crettez

Software Architect
Big Data Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Spark running slow for small hadoop files of 10 mb size

2014-04-21 Thread neeravsalaria
Hi, 

  i have been using MapReduce to analyze multiple files whose size can range
from 10 mb to 200mb per file. recently i  planned to move spark , but my
spark Job is taking too much time executing a single file in case my file
size is 10MB and hdfs block size is 64MB. It is executing on a single
datanode and on single core(my cluster is a 4 node setup / each node having
32 cores). each file is having 3 million rows and i have to analyze each
row(ignore none) and create a set of info from it.

Isn't a way where i can parallelize the processing of the file like either
on other nodes or use the remaining cores of the same node. 
 


demo code : 

 val recordsRDD = 
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) /*to
parallelize */

 infoRdd = recordsRDD.map(f => info_func())

 hdfs_RDD = infoRDD.reduceByKey(_+_,48)  /* makes 48 paritions*/

hdfs_RDD.saveAsNewAPIHadoopFile



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.