Re: Spark-csv- partitionBy

2016-05-10 Thread Xinh Huynh
Hi Pradeep,

Here is a way to partition your data into different files, by calling
repartition() on the dataframe:
df.repartition(12, $"Month")
  .write
  .format(...)

This is assuming you want to partition by a "month" column where there are
12 different values. Each partition will be stored in a separate file (but
in the same folder).

Xinh

On Tue, May 10, 2016 at 2:10 AM, Mail.com  wrote:

> Hi,
>
> I don't want to reduce partitions. Should write files depending upon the
> column value.
>
> Trying to understand how reducing partition size will make it work.
>
> Regards,
> Pradeep
>
> On May 9, 2016, at 6:42 PM, Gourav Sengupta 
> wrote:
>
> Hi,
>
> its supported, try to use coalesce(1) (the spelling is wrong) and after
> that do the partitions.
>
> Regards,
> Gourav
>
> On Mon, May 9, 2016 at 7:12 PM, Mail.com  <
> pradeep.mi...@mail.com> wrote:
>
>> Hi,
>>
>> I have to write tab delimited file and need to have one directory for
>> each unique value of a column.
>>
>> I tried using spark-csv with partitionBy and seems it is not supported.
>> Is there any other option available for doing this?
>>
>> Regards,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark-csv- partitionBy

2016-05-10 Thread Mail.com
Hi,

I don't want to reduce partitions. Should write files depending upon the column 
value.

Trying to understand how reducing partition size will make it work.

Regards,
Pradeep

> On May 9, 2016, at 6:42 PM, Gourav Sengupta  wrote:
> 
> Hi,
> 
> its supported, try to use coalesce(1) (the spelling is wrong) and after that 
> do the partitions.
> 
> Regards,
> Gourav
> 
>> On Mon, May 9, 2016 at 7:12 PM, Mail.com  wrote:
>> Hi,
>> 
>> I have to write tab delimited file and need to have one directory for each 
>> unique value of a column.
>> 
>> I tried using spark-csv with partitionBy and seems it is not supported. Is 
>> there any other option available for doing this?
>> 
>> Regards,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: Spark-csv- partitionBy

2016-05-09 Thread Gourav Sengupta
Hi,

its supported, try to use coalesce(1) (the spelling is wrong) and after
that do the partitions.

Regards,
Gourav

On Mon, May 9, 2016 at 7:12 PM, Mail.com  wrote:

> Hi,
>
> I have to write tab delimited file and need to have one directory for each
> unique value of a column.
>
> I tried using spark-csv with partitionBy and seems it is not supported. Is
> there any other option available for doing this?
>
> Regards,
> Pradeep
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark-csv- partitionBy

2016-05-09 Thread Mail.com
Hi,

I have to write tab delimited file and need to have one directory for each 
unique value of a column.

I tried using spark-csv with partitionBy and seems it is not supported. Is 
there any other option available for doing this?

Regards,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-csv partitionBy

2016-02-09 Thread Srikanth
Hello,

I want to save Spark job result as LZO compressed CSV files partitioned by
one or more columns.
Given that partitionBy is not supported by spark-csv, is there any
recommendation for achieving this in user code?

One quick option is to
  i) cache the result dataframe
  ii) get unique partition keys
  iii) Iterate over keys and filter the result for that key

   rawDF.cache
   val idList =
rawDF.select($"ID").distinct.collect.toList.map(_.getLong(0))
  idList.foreach( id => {
val rows = rawDF.filter($"ID" === id)

rows.write.format("com.databricks.spark.csv").save(s"hdfs:///output/id=$id/")
  })

This approach doesn't scale well. Especially since no.of unique IDs can be
between 500-700.
And adding a second partition column will make this even worst.

Wondering if anyone has an efficient work around?

Srikanth