Unsubscribe

2017-08-07 Thread sowmya ramesh
Unsubscribe

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-07 Thread shyla deshpande
Thanks TD. On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das wrote: > I dont think there is any easier way. > > On Mon, Aug 7, 2017 at 7:32 PM, shyla deshpande > wrote: > >> Thanks TD for the response. I forgot to mention that I am not using

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-07 Thread Tathagata Das
I dont think there is any easier way. On Mon, Aug 7, 2017 at 7:32 PM, shyla deshpande wrote: > Thanks TD for the response. I forgot to mention that I am not using > structured streaming. > > I was looking into KafkaUtils.createRDD, and looks like I need to get the >

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-07 Thread shyla deshpande
Thanks TD for the response. I forgot to mention that I am not using structured streaming. I was looking into KafkaUtils.createRDD, and looks like I need to get the earliest and the latest offset for each partition to build the Array(offsetRange). I wanted to know if there was a easier way. 1

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-07 Thread Tathagata Das
Its best to use DataFrames. You can read from as streaming or as batch. More details here. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries

KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-07 Thread shyla deshpande
Hi all, What is the easiest way to read all the data from kafka in a batch program for a given topic? I have 10 kafka partitions, but the data is not much. I would like to read from the earliest from all the partitions for a topic. I appreciate any help. Thanks

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Michael Armbrust
I think there is really no good reason for this limitation. On Mon, Aug 7, 2017 at 2:58 AM, Jacek Laskowski wrote: > Hi, > > While exploring checkpointing with kafka source and console sink I've > got the exception: > > // today's build from the master > scala> spark.version >

[spark-core] Choosing the correct number of partitions while joining two RDDs with partitioner set on one

2017-08-07 Thread Piyush Narang
hi folks, I was debugging a Spark job that ending up with too few partitions during the join step and thought I'd reach out understand if this is the right behavior / what typical workarounds are. I have two RDDs that I'm joining. One with a lot of partitions (5K+) and one with much lesser

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread shyla deshpande
I am doing that already for all known messy data. Thanks Cody for all your time and input On Mon, Aug 7, 2017 at 11:58 AM, Cody Koeninger wrote: > Yes > > On Mon, Aug 7, 2017 at 12:32 PM, shyla deshpande > wrote: > > Thanks Cody again. > > > > No.

Spark sample submitted with cluster deploy-mode does not work in Standalone

2017-08-07 Thread ctang
Spark sample submitted with cluster deploy-mode (./bin/run-example --master spark://localhost:6066 --deploy-mode=cluster SparkPi 10) throw out following error, any one knows what the problem is? == 17/08/07 16:27:39 ERROR RestSubmissionClient: Exception from the cluster:

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
Yes On Mon, Aug 7, 2017 at 12:32 PM, shyla deshpande wrote: > Thanks Cody again. > > No. I am doing mapping of the Kafka ConsumerRecord to be able to save it in > the Cassandra table and saveToCassandra is an action and my data do get > saved into Cassandra. It is

Re: tuning - Spark data serialization for cache() ?

2017-08-07 Thread Ofir Manor
Thanks a lot for the quick pointer! So, is the advice I linked to in official Spark 2.2 documentation misleading? You are saying that Spark 2.2 does not use by Java serialization? And the tip to switch to Kyro is also outdated? Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 |

Re: tuning - Spark data serialization for cache() ?

2017-08-07 Thread Kazuaki Ishizaki
For Dataframe (and Dataset), cache() already uses fast serialization/deserialization with data compression schemes. We already identified some performance issues regarding cache(). We are working for alleviating these issues in https://issues.apache.org/jira/browse/SPARK-14098. We expect that

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread shyla deshpande
Thanks Cody again. No. I am doing mapping of the Kafka ConsumerRecord to be able to save it in the Cassandra table and saveToCassandra is an action and my data do get saved into Cassandra. It is working as expected 99% of the time except that when there is an exception, I did not want the

tuning - Spark data serialization for cache() ?

2017-08-07 Thread Ofir Manor
Hi, I'm using Spark 2.2, and have a big batch job, using dataframes (with built-in, basic types). It references the same intermediate dataframe multiple times, so I wanted to try to cache() that and see if it helps, both in memory footprint and performance. Now, the Spark 2.2 tuning page (

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
If literally all you are doing is rdd.map I wouldn't expect saveToCassandra to happen at all, since map is not an action. Filtering for unsuccessful attempts and collecting those back to the driver would be one way for the driver to know whether it was safe to commit. On Mon, Aug 7, 2017 at

Spark 2.1 table loaded from Hive Metastore has null values

2017-08-07 Thread Shmuel Blitz
(Also asked on SO at https://stackoverflow.com/q/45543140/416300) I am trying to migrate table definitions from one Hive metastore to another. The source cluster has: - Spark 1.6.0 - Hive 1.1.0 (cdh) - HDFS The destination cluster is an EMR cluster with: - Spark 2.1.1 - Hive

[SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Jacek Laskowski
Hi, While exploring checkpointing with kafka source and console sink I've got the exception: // today's build from the master scala> spark.version res8: String = 2.3.0-SNAPSHOT scala> val q = records. | writeStream. | format("console"). | option("truncate", false). |

[no subject]

2017-08-07 Thread Sumit Saraswat
Unsubscribe