unsubscribe

2016-10-02 Thread Nikos Viorres

How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-28 Thread Nikos Viorres
Hi, I am using KafkaUtils.createRDD to retrieve data from Kafka for batch processing and when Invoking KafkaUtils.createRDD with an OffsetRange where OffsetRange.fromOffset == OffsetRange.untilOffset for a particular partition, i get an empy RDD. Documentation is clear that until is exclusive and

Re: DataFrame.write().partitionBy(some_column).parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Nikos Viorres
repartition the data by partition columns first. Cheng On 7/15/15 7:05 PM, Nikos Viorres wrote: Hi, I am trying to test partitioning for DataFrames with parquet usage so i attempted to do df.write().partitionBy(some_column).parquet(path) on a small dataset of 20.000 records which when saved

DataFrame.write().partitionBy(some_column).parquet(path) produces OutOfMemory with very few items

2015-07-15 Thread Nikos Viorres
Hi, I am trying to test partitioning for DataFrames with parquet usage so i attempted to do df.write().partitionBy(some_column).parquet(path) on a small dataset of 20.000 records which when saved as parquet locally with gzip take 4mb of disk space. However, on my dev machine with

Re: updateStateByKey performance API

2015-03-18 Thread Nikos Viorres
Hi Akhil, Yes, that's what we are planning on doing at the end of the data. At the moment I am doing performance testing before the job hits production and testing on 4 cores to get baseline figures and deduced that in order to grow to 10 - 15 million keys we ll need at batch interval of ~20 secs

updateStateByKey performance / API

2015-03-18 Thread Nikos Viorres
Hi all, We are having a few issues with the performance of updateStateByKey operation in Spark Streaming (1.2.1 at the moment) and any advice would be greatly appreciated. Specifically, on each tick of the system (which is set at 10 secs) we need to update a state tuple where the key is the

updateStateByKey performance

2015-03-17 Thread Nikos Viorres
Hi all, We are having a few issues with the performance of updateStateByKey operation in Spark Streaming (1.2.1 at the moment) and any advice would be greatly appreciated. Specifically, on each tick of the system (which is set at 10 secs) we need to update a state tuple where the key is the