Re: How to write a custom file system?

2016-11-21 Thread Samy Dindane
about side effects to any application including spark (memory consumption etc). On 21 Nov 2016, at 18:26, Samy Dindane <s...@dindane.com> wrote: Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend File

How to write a custom file system?

2016-11-21 Thread Samy Dindane
Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure how to go about it exactly. How to write a custom file system and make it usable by Spark?

Re: How to load only the data of the last partition

2016-11-18 Thread Samy Dindane
, you'll always get the latest version. Daniel On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <s...@dindane.com <mailto:s...@dindane.com>> wrote: Hi, I have some data partitioned this way: /data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2

How to load only the data of the last partition

2016-11-17 Thread Samy Dindane
Hi, I have some data partitioned this way: /data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2016/month=10/version=1 /data/year=2016/month=10/version=2 /data/year=2016/month=10/version=3 /data/year=2016/month=11/version=0 /data/year=2016/month=11/version=1 When

How to impersonate a user from a Spark program

2016-11-09 Thread Samy Dindane
Hi, In order to impersonate a user when submitting a job with `spark-submit`, the `proxy-user` option is used. Is there a similar feature when running a job inside a Scala program? Maybe by specifying some configuration value? Thanks. Samy

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
number does not change. BTW, how many kafka partitions are you using, and how many actually have data for a given batch? 3 partitions. All of them have more than maxRatePerPartition records (my topic has hundred of millions of records). On Thu, Oct 13, 2016 at 4:33 AM, Samy Dindane &l

Re: DataFrame API: how to partition by a "virtual" column, or by a nested column?

2016-10-13 Thread Samy Dindane
This partially answers the question: http://stackoverflow.com/a/35449563/604041 On 10/04/2016 03:10 PM, Samy Dindane wrote: Hi, I have the following schema: -root |-timestamp |-date |-year |-month |-day |-some_column |-some_other_column I'd like to achieve either of these: 1

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
fe.org/e730492453.png notice the cutover point On Wed, Oct 12, 2016 at 11:00 AM, Samy Dindane <s...@dindane.com> wrote: I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, d

How to prevent having more than one instance of a specific job running on the cluster

2016-10-12 Thread Samy Dindane
Hi, I'd like a specific job to fail if there's another instance of it already running on the cluster (Spark Standalone in my case). How to achieve this? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, did you set spark.streaming.backpressure.enabled to true? On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane <s...@dinda

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
Koeninger wrote: http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane <s...@

Re: What happens when an executor crashes?

2016-10-12 Thread Samy Dindane
rite fileA and fileB, because they already have correct data and offsets. You just write fileC. Then once youve recovered you go on about your job as normal, starting at topic-0 offsets 60, topic-1 offsets 66 Clear as mud? On Mon, Oct 10, 2016 at 5:36 PM, Samy Dindane <s...@dindane.com> wr

Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane
Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
preciate your help. Thanks a lot. On Mon, Oct 10, 2016 at 12:12 PM, Samy Dindane <s...@dindane.com> wrote: I just noticed that you're the author of the code I linked in my previous email. :) It's helpful. When using `foreachPartition` or `mapPartitions`, I noticed I can't ask Spark to write t

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
you. Samy On 10/10/2016 04:58 PM, Samy Dindane wrote: Hi Cody, I am writing a spark job that reads records from a Kafka topic and writes them on the file system. This would be straightforward if it weren't for the custom checkpointing logic I want to have; Spark's checkpointing doesn't suit us