Challenges with Datasource V2 API

2019-06-25 Thread Sunita Arvind
to surface the problem. Can someone review the code and tell me if I am doing something wrong? regards Sunita

Re: a way to allow spark job to continue despite task failures?

2018-01-24 Thread Sunita Arvind
for failed tasks were done, other tasks completed. You can set it to higher or lower value depending on how many more tasks you have and the duration they take to complete. regards Sunita On Fri, Nov 13, 2015 at 4:50 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I searched the code base and

Re: Chaining Spark Streaming Jobs

2017-11-02 Thread Sunita Arvind
(UnsupportedOperationChecker.scala:297) regards Sunita On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust <mich...@databricks.com> wrote: > You specify the schema when loading a dataframe by calling > spark.read.schema(...)... > > On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind <

Change the owner of hdfs file being saved

2017-11-02 Thread Sunita Arvind
usecase. Is there a way to change the owner of files written by Spark? regards Sunita

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
> Le 13 sept. 2017 01:51, "Sunita Arvind" <sunitarv...@gmail.com> a écrit : > > Hi Michael, > > I am wondering what I am doing wrong. I get error like: > > Exception in thread "main" java.lang.IllegalArgumentException: Schema > must be specified when

Re: Chaining Spark Streaming Jobs

2017-09-12 Thread Sunita Arvind
e.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:222) While running on the EMR cluster all paths poi

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Sunita Arvind
Thanks for your response Praneeth. We did consider Kafka however cost was the only hold back factor as we might need a larger cluster and existing cluster is on premise and my app is on cloud. So the same cluster cannot be used. But I agree it does sound like a good alternative. Regards Sunita

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
Thanks for your response Michael Will try it out. Regards Sunita On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <mich...@databricks.com> wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the file source. This will main

Chaining Spark Streaming Jobs

2017-08-21 Thread Sunita Arvind
to be error prone. When either of the jobs get delayed due to bursts or any error/exception this could lead to huge data losses and non-deterministic behavior . What are other alternatives to this? Appreciate any guidance in this regard. regards Sunita Koppar

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
as parquet with null in the numeric fields. Is there a workaround to it? I need to be able to allow null values for numeric fields Thanks in advance. regards Sunita

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
sure I am not doing an overkill or overseeing a potential issue. regards Sunita On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > The error in the file I just shared is here: > > val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/&q

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Thanks for the response Sean. I have seen the NPE on similar issues very consistently and assumed that could be the reason :) Thanks for clarifying. regards Sunita On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen <so...@cloudera.com> wrote: > This usage is fine, because you are o

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
create the dataframe in main, you can register it as a table and run the queries in main method itself. You don't need to coalesce or run the method within foreach. Regards Sunita On Tuesday, October 25, 2016, Ajay Chander <itsche...@gmail.com> wrote: > > Jeff, Thanks for your response.

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
eeper") df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved) LogHandler.log.info("Created the parquet file") } Thanks Sunita On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Attached is the edi

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Sunita On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Thanks for confirming Cody. > To get to use the library, I had to do: > > val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), > "/consumers/topics/"+ t

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
I want the library to pick all the partitions for a topic, without me specifying the path, is it possible out of the box or I need to tweak? regards Sunita On Tue, Oct 25, 2016 at 12:08 PM, Cody Koeninger <c...@koeninger.org> wrote: > You are correct that you shouldn't have to worry about broker id.

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Just re-read the kafka architecture. Something that slipped my mind is, it is leader based. So topic/partitionId pair will be same on all the brokers. So we do not need to consider brokerid while storing offsets. Still exploring rest of the items. regards Sunita On Tue, Oct 25, 2016 at 11:09 AM

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
not considering brokerIds which storing offsets and probably the OffsetRanges does not have it either. It can only provide Topic, partition, from and until offsets. I am probably missing something very basic. Probably the library works well by itself. Can someone/ Cody explain? Cody, Thanks a lot for

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind
Hello Experts, Is there a way to get spark to write to elasticsearch asynchronously? Below are the details http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously regards Sunita

Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sunita Arvind
interesting observation is, bringing down the executor memory to 5GB with executor memoryOverhead to 768 showed significant performance gains. What are the other associated settings? regards Sunita

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
Thank you for your inputs. Will test it out and share my findings On Thursday, July 14, 2016, CosminC wrote: > Didn't have the time to investigate much further, but the one thing that > popped out is that partitioning was no longer working on 1.6.1. This would > definitely

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Sunita
I am facing the same issue. Upgrading to Spark1.6 is causing hugh performance loss. Could you solve this issue? I am also attempting memory settings as mentioned http://spark.apache.org/docs/latest/configuration.html#memory-management But its not making a lot of difference. Appreciate your inputs

Maintain complete state for updateStateByKey

2016-07-06 Thread Sunita Arvind
trying to figure out if I can use the (iterator: Iterator[(K, Seq[V], Option[S])]) but haven't figured it out yet. Appreciate any suggestions in this regard. regards Sunita P.S: I am aware of mapwithState but not on the latest version as of now.

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
distribution data sets. Mentioning it here for benefit of anyone else stumbling upon the same issue. regards Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
r.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 38 more 16/06/23 11:09:53 INFO SparkContext: Invoking stop() from shutdown hook I've tried kafka version 0.8.2.0, 0.8.2.2, 0.9.0.0. With 0.9.0.0 the processing hangs much sooner. Can someone help with this error? rega

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
ssc.awaitTermination() } } } I also tried putting all the initialization directly in main (not using method calls for initializeSpark and createDataStreamFromKafka) and also not putting in foreach and creating a single spark and streaming context. However, the error persists. Appreciate any help. regards Sunita

Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
or do I need to have HiveContext in order to see the tables registered via Spark application through the JDBC? regards Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Thanks for the clarification Michael and good luck with Spark 2.0. It really looks promising. I am especially interested in adhoc queries aspect. Probably that is what is being referred to as Continuous SQL in the slides. What is the timeframe for availability this functionality? regards Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
in 2.1 or later only regards Sunita On Fri, May 6, 2016 at 1:06 PM, Michael Malak <michaelma...@yahoo.com> wrote: > At first glance, it looks like the only streaming data sources available > out of the box from the github master branch are > https://github.com/apache/spark/blob/mast

Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
to ensure it works for our use cases. Can someone point me to relevant material for this. regards Sunita

Spark SQL - Registerfunction throwing MissingRequirementError in JavaMirror with primordial classloader

2015-04-26 Thread Sunita Arvind
of years is 10 Within 10 years is true () Appreciate any direction from the community. regards Sunita Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C

Unable to broadcast dimension tables with Spark SQL

2015-02-16 Thread Sunita Arvind
Exchange (HashPartitioning [education#18], 200) ParquetTableScan [education#18,education_desc#19], (ParquetRelation C:/Sunita/eclipse/workspace/branch/trial/plsresources/plsbuyer/cg_pq_cdw_education, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn

Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
of effort for us to try this approach and weight the performance as we need to register the output as tables to proceed using them. Hence would appreciate inputs from the community before proceeding. Regards Sunita Koppar

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind
I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind sunitarv...@gmail.com wrote: Hi, My spark jobs suddenly started getting hung and here

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
names. The spark sql wiki has good examples for this. Looks more easy to manage to me than your solution below. Agree with you on the fact that when there are lot of columns, row.getString() even once is not convenient Regards Sunita On Tuesday, January 20, 2015, Night Wolf nightwolf...@gmail.com

Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-17 Thread Sunita Arvind
will edit it and post. regards Sunita

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind
(SQLContext.scala:94) at croevss.StageJoin$.vsswf(StageJoin.scala:162) at croevss.StageJoin$.main(StageJoin.scala:41) at croevss.StageJoin.main(StageJoin.scala) regards Sunita Koppar

Re: Spark setup on local windows machine

2014-12-02 Thread Sunita Arvind
) regards Sunita On Tue, Nov 25, 2014 at 11:47 PM, Sameer Farooqui same...@databricks.com wrote: Hi Sunita, This gitbook may also be useful for you to get Spark running in local mode on your Windows machine: http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/ On Tue, Nov 25

Spark setup on local windows machine

2014-11-25 Thread Sunita Arvind
your help. regards Sunita

GraphX usecases

2014-08-25 Thread Sunita Arvind
-tolerance. mean that GraphX makes the typical RDBMS operations possible even when the data is persisted in a GDBMS and not viceversa? regards Sunita

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-03 Thread Sunita Arvind
-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ Romain Romain On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind sunitarv...@gmail.com javascript:_e(%7B%7D,'cvml','sunitarv...@gmail.com'); wrote: Hello Experts, I am attempting to integrate Spark Editor with Hue