Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel

Can you elaborate why are you using a broadcast variable to concatenate
many Avro files into a single ORC file. Look at wholetextfiles on Spark
context.

SparkContext.wholeTextFiles lets you read a directory containing multiple
small text files, and returns each of them as (filename, content) pairs.
This is in contrast with textFile, which would return one record per line
in each file.
​
You can then process this RDD in parallel over the cluster, convert to a
dataframe and save as a ORC file.

Regards
Deenar


Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with.
The idea is load these files and merge them into ORC.
When using hive on Tez it takes less than a minute. 

Daniel

> On 22 בספט׳ 2015, at 16:00, Jonathan Coveney  wrote:
> 
> having a file per record is pretty inefficient on almost any file system
> 
> El martes, 22 de septiembre de 2015, Daniel Haviv 
>  escribió:
>> Hi,
>> We are trying to load around 10k avro files (each file holds only one 
>> record) using spark-avro but it takes over 15 minutes to load.
>> It seems that most of the work is being done at the driver where it created 
>> a broadcast variable for each file.
>> 
>> Any idea why is it behaving that way ?
>> Thank you.
>> Daniel


RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
Your performance problem sounds like in the driver, which is trying to 
boardcast 10k files by itself alone, which becomes the bottle neck.
What you wants is just transfer the data from AVRO format per file to another 
format. In MR, most likely each mapper process one file, and you utilized the 
whole cluster, instead of just using the Driver in MR.
Not sure exactly how to help you, but to do that in the Spark:
1) Disable the boardcast from the driver, let the each task in the Spark to 
process one file. Maybe use something like hadoop NLineInputFormat, which 
including all the filenames of your data, so each Spark task will receive the 
HDFS location of each file, then start the transform logic. In this case, you 
concurrently transform all your small files by using all the available cores of 
your executors.2) If above sounds too complex, you need to find the way to 
disable boardcasting small files from the Spark Driver. This sounds like a good 
normal way to handle small files, but I cannot find a configuration to force 
spark disable it.
Yong

From: daniel.ha...@veracity-group.com
Subject: Re: spark-avro takes a lot time to load thousands of files
Date: Tue, 22 Sep 2015 16:54:26 +0300
CC: user@spark.apache.org
To: jcove...@gmail.com

I Agree but it's a constraint I have to deal with.The idea is load these files 
and merge them into ORC.When using hive on Tez it takes less than a minute. 
Daniel
On 22 בספט׳ 2015, at 16:00, Jonathan Coveney  wrote:

having a file per record is pretty inefficient on almost any file system

El martes, 22 de septiembre de 2015, Daniel Haviv 
 escribió:
Hi,We are trying to load around 10k avro files (each file holds only one 
record) using spark-avro but it takes over 15 minutes to load.It seems that 
most of the work is being done at the driver where it created a broadcast 
variable for each file.
Any idea why is it behaving that way ?Thank you.Daniel

  

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney
having a file per record is pretty inefficient on almost any file system

El martes, 22 de septiembre de 2015, Daniel Haviv <
daniel.ha...@veracity-group.com> escribió:

> Hi,
> We are trying to load around 10k avro files (each file holds only one
> record) using spark-avro but it takes over 15 minutes to load.
> It seems that most of the work is being done at the driver where it
> created a broadcast variable for each file.
>
> Any idea why is it behaving that way ?
> Thank you.
> Daniel
>