Re: Splittable or not?

2022-09-19 Thread Jack Goodson
When reading in Gzip files, I’ve always read them into a data frame and then 
written out to parquet/delta more or less in their raw form and then used these 
files for my transformations as the workloads are now parallelisable from these 
split files, when reading in Gzips these will be read by the driver so you will 
be limited by the memory in the driver so you may need to have an iterative 
step initially if all your Gzips cannot fit into memory in the driver (this may 
require some experimentation)

If you don’t want to have an intermediate step of writing the files you can use 
SparkContext.parallelize(yourgzipfile) 

Hope this helps 

> On 19/09/2022, at 9:45 PM, Sid  wrote:
> 
> Cool. Thanks, everyone for the reply.
> 
> On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack  <mailto:i...@enrico.minack.dev>> wrote:
> If with "won't affect the performance" you mean "parquet is splittable though 
> it uses snappy", then yes. Splittable files allow for optimal 
> parallelization, which "won't affect performance".
> 
> Spark writing data will split the data into multiple files already (here 
> parquet files). Even if each file would not be splittable, your data have 
> been split already. Splittable parquet files allow for more granularity (more 
> splitting if your data), in case those files are big.
> 
> Enrico
> 
> 
> Am 14.09.22 um 21:57 schrieb Sid:
>> Okay so you mean to say that parquet compresses the denormalized data using 
>> snappy so it won't affect the performance.
>> 
>> Only using snappy will affect the performance
>> 
>> Am I correct?
>> 
>> On Thu, 15 Sep 2022, 01:08 Amit Joshi, > <mailto:mailtojoshia...@gmail.com>> wrote:
>> Hi Sid,
>> 
>> Snappy itself is not splittable. But the format that contains the actual 
>> data like parquet (which are basically divided into row groups) can be 
>> compressed using snappy.
>> This works because blocks(pages of parquet format) inside the parquet can be 
>> independently compressed using snappy.
>> 
>> Thanks
>> Amit
>> 
>> On Wed, Sep 14, 2022 at 8:14 PM Sid > <mailto:flinkbyhe...@gmail.com>> wrote:
>> Hello experts,
>> 
>> I know that Gzip and snappy files are not splittable i.e data won't be 
>> distributed into multiple blocks rather it would try to load the data in a 
>> single partition/block
>> 
>> So, my question is when I write the parquet data via spark it gets stored at 
>> the destination with something like part*.snappy.parquet
>> 
>> So, when I read this data will it affect my performance?
>> 
>> Please help me if there is any understanding gap.
>> 
>> Thanks,
>> Sid
> 



Re: Splittable or not?

2022-09-19 Thread Sid
Cool. Thanks, everyone for the reply.

On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack 
wrote:

> If with "won't affect the performance" you mean "parquet is splittable
> though it uses snappy", then yes. Splittable files allow for optimal
> parallelization, which "won't affect performance".
>
> Spark writing data will split the data into multiple files already (here
> parquet files). Even if each file would not be splittable, your data have
> been split already. Splittable parquet files allow for more granularity
> (more splitting if your data), in case those files are big.
>
> Enrico
>
>
> Am 14.09.22 um 21:57 schrieb Sid:
>
> Okay so you mean to say that parquet compresses the denormalized data
> using snappy so it won't affect the performance.
>
> Only using snappy will affect the performance
>
> Am I correct?
>
> On Thu, 15 Sep 2022, 01:08 Amit Joshi,  wrote:
>
>> Hi Sid,
>>
>> Snappy itself is not splittable. But the format that contains the actual
>> data like parquet (which are basically divided into row groups) can be
>> compressed using snappy.
>> This works because blocks(pages of parquet format) inside the parquet can
>> be independently compressed using snappy.
>>
>> Thanks
>> Amit
>>
>> On Wed, Sep 14, 2022 at 8:14 PM Sid  wrote:
>>
>>> Hello experts,
>>>
>>> I know that Gzip and snappy files are not splittable i.e data won't be
>>> distributed into multiple blocks rather it would try to load the data in a
>>> single partition/block
>>>
>>> So, my question is when I write the parquet data via spark it gets
>>> stored at the destination with something like *part*.snappy.parquet*
>>>
>>> So, when I read this data will it affect my performance?
>>>
>>> Please help me if there is any understanding gap.
>>>
>>> Thanks,
>>> Sid
>>>
>>
>


Re: Splittable or not?

2022-09-17 Thread Enrico Minack
If with "won't affect the performance" you mean "parquet is splittable 
though it uses snappy", then yes. Splittable files allow for optimal 
parallelization, which "won't affect performance".


Spark writing data will split the data into multiple files already (here 
parquet files). Even if each file would not be splittable, your data 
have been split already. Splittable parquet files allow for more 
granularity (more splitting if your data), in case those files are big.


Enrico


Am 14.09.22 um 21:57 schrieb Sid:
Okay so you mean to say that parquet compresses the denormalized data 
using snappy so it won't affect the performance.


Only using snappy will affect the performance

Am I correct?

On Thu, 15 Sep 2022, 01:08 Amit Joshi,  wrote:

Hi Sid,

Snappy itself is not splittable. But the format that contains the
actual data like parquet (which are basically divided into row
groups) can be compressed using snappy.
This works because blocks(pages of parquet format) inside the
parquet can be independently compressed using snappy.

Thanks
Amit

On Wed, Sep 14, 2022 at 8:14 PM Sid  wrote:

Hello experts,

I know that Gzip and snappy files are not splittable i.e data
won't be distributed into multiple blocks rather it would try
to load the data in a single partition/block

So, my question is when I write the parquet data via spark it
gets stored at the destination with something like
/part*.snappy.parquet/
/
/
So, when I read this data will it affect my performance?

Please help me if there is any understanding gap.

Thanks,
Sid



Re: Splittable or not?

2022-09-14 Thread Sid
Okay so you mean to say that parquet compresses the denormalized data using
snappy so it won't affect the performance.

Only using snappy will affect the performance

Am I correct?

On Thu, 15 Sep 2022, 01:08 Amit Joshi,  wrote:

> Hi Sid,
>
> Snappy itself is not splittable. But the format that contains the actual
> data like parquet (which are basically divided into row groups) can be
> compressed using snappy.
> This works because blocks(pages of parquet format) inside the parquet can
> be independently compressed using snappy.
>
> Thanks
> Amit
>
> On Wed, Sep 14, 2022 at 8:14 PM Sid  wrote:
>
>> Hello experts,
>>
>> I know that Gzip and snappy files are not splittable i.e data won't be
>> distributed into multiple blocks rather it would try to load the data in a
>> single partition/block
>>
>> So, my question is when I write the parquet data via spark it gets stored
>> at the destination with something like *part*.snappy.parquet*
>>
>> So, when I read this data will it affect my performance?
>>
>> Please help me if there is any understanding gap.
>>
>> Thanks,
>> Sid
>>
>


Re: Splittable or not?

2022-09-14 Thread Amit Joshi
Hi Sid,

Snappy itself is not splittable. But the format that contains the actual
data like parquet (which are basically divided into row groups) can be
compressed using snappy.
This works because blocks(pages of parquet format) inside the parquet can
be independently compressed using snappy.

Thanks
Amit

On Wed, Sep 14, 2022 at 8:14 PM Sid  wrote:

> Hello experts,
>
> I know that Gzip and snappy files are not splittable i.e data won't be
> distributed into multiple blocks rather it would try to load the data in a
> single partition/block
>
> So, my question is when I write the parquet data via spark it gets stored
> at the destination with something like *part*.snappy.parquet*
>
> So, when I read this data will it affect my performance?
>
> Please help me if there is any understanding gap.
>
> Thanks,
> Sid
>


Splittable or not?

2022-09-14 Thread Sid
Hello experts,

I know that Gzip and snappy files are not splittable i.e data won't be
distributed into multiple blocks rather it would try to load the data in a
single partition/block

So, my question is when I write the parquet data via spark it gets stored
at the destination with something like *part*.snappy.parquet*

So, when I read this data will it affect my performance?

Please help me if there is any understanding gap.

Thanks,
Sid


Re: Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
On 16 Nov 2017, at 10:22, Michael Shtelma  wrote:
> you call repartition(1) before starting processing your files. This
> will ensure that you end up with just one partition.

One question and one remark:

Q) val ds = sqlContext.read.parquet(path).repartition(1)

Am I absolutely sure that my file here is read by a single executor and that no 
data shuffling takes place afterwards to get that single partition?

R) This approach did not work for me.

val ds = sqlContext.read.parquet(path).repartition(1)

// ds on a single partition

ds.createOrReplaceTempView("ds")

val result = sqlContext.sql("... from ds")

// result on 166 partitions... How to force the processing on a
// single executor?

result.write.csv(...)

// 166 files :-/

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
Dear Sparkers,

A while back, I asked how to process non-splittable files in parallel, one file 
per executor. Vadim's suggested "scheduling within an application" approach 
worked out beautifully.

I am now facing the 'opposite' problem:

 - I have a bunch of parquet files to process
 - Once processed I need to output a /single/ file for each input file
 - When I read a parquet file, it gets partitioned over several executors
 - If I want a single output file, I would need to coalesce(1) with potential
   performance issues.

Since my files are relatively small, a single file could be handled by a single 
executor, and several files could be read in parallel, one for each executor.

My question is: how to force my parquet file to be read by a single executor, 
without repartitioning or coalescing of course.

Regards,

Jeroen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-31 Thread Hyukjin Kwon
Hm.. As I said here
https://github.com/databricks/spark-csv/issues/245#issuecomment-177682354,

It sounds reasonable in a way though. For me, this might be to deal with
some narrow use-cases.

How about using csvRdd(),
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L143-L162
?

I think you can do this like below:


val rdd = sc.newAPIHadoopFile("/file.csv.lzo",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text])
val df = new CsvParser()
  .csvRdd(sqlContext, rdd)



2016-01-30 10:04 GMT+09:00 syepes <sye...@gmail.com>:

> Well looking at the src it look like its not implemented:
>
>
> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Well looking at the src it look like its not implemented:

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Hello,
​
I have managed to speed up the read stage when loading CSV files using the
classic "newAPIHadoopFile" method, the issue is that I would like to use the
spark-csv package and it seams that its not taking into consideration the
LZO Index file / Splittable reads.

/# Using the classic method the read is fully parallelized (Splittable)/
sc.newAPIHadoopFile("/user/sy/data.csv.lzo",  ).count

/# When spark-csv is used the file is read only from one node (No Splittable
reads)/
sqlContext.read.format("com.databricks.spark.csv").options(Map("path" ->
"/user/sy/data.csv.lzo", "header" -> "true", "inferSchema" ->
"false")).load().count()

Does anyone know if this is currently supported?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org