Challenges with Datasource V2 API

2019-06-25 Thread Sunita Arvind
Hello Spark Experts, I am having challenges using the DataSource V2 API. I created a mock The input partitions seem to be created correctly. The below output confirms that: 19/06/23 16:00:21 INFO root: createInputPartitions 19/06/23 16:00:21 INFO root: Create a partition for abc The

Re: a way to allow spark job to continue despite task failures?

2018-01-24 Thread Sunita Arvind
Had a similar situation and landed on this question. Finally I was able to make it do what I needed by cheating the spark driver :) i.e By setting a very high value for "--conf spark.task.maxFailures=800". I made it 800 deliberately which typically is 4. So by the time 800 attempts for failed

Re: Chaining Spark Streaming Jobs

2017-11-02 Thread Sunita Arvind
(UnsupportedOperationChecker.scala:297) regards Sunita On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust <mich...@databricks.com> wrote: > You specify the schema when loading a dataframe by calling > spark.read.schema(...)... > > On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind <

Change the owner of hdfs file being saved

2017-11-02 Thread Sunita Arvind
Hello Experts, I am required to use a specific user id to save files on a remote hdfs cluster. Remote in the sense, spark jobs run on EMR and write to a CDH cluster. Hence I cannot change the hdfs-site.xml etc to point to the destination cluster. As a result I am using webhdfs to save the files

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
> Le 13 sept. 2017 01:51, "Sunita Arvind" <sunitarv...@gmail.com> a écrit : > > Hi Michael, > > I am wondering what I am doing wrong. I get error like: > > Exception in thread "main" java.lang.IllegalArgumentException: Schema > must be specified when

Re: Chaining Spark Streaming Jobs

2017-09-12 Thread Sunita Arvind
nt to S3. In my laptop, they all point to local filesystem. I am using Spark2.2.0 Appreciate your help. regards Sunita On Wed, Aug 23, 2017 at 2:30 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream rea

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Sunita Arvind
cond job in the chain is only a > few mins, you may have to end up creating a new file every few mins > > You may want to consider Kafka as your intermediary store for building a > chain/DAG of streaming jobs > > On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind <sunitarv...@gmail.

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
tain exactly > once processing even if there are hiccups or failures. > > On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com> > wrote: > >> Hello Spark Experts, >> >> I have a design question w.r.t Spark Streaming. I have a streaming job >>

Chaining Spark Streaming Jobs

2017-08-21 Thread Sunita Arvind
Hello Spark Experts, I have a design question w.r.t Spark Streaming. I have a streaming job that consumes protocol buffer encoded real time logs from a Kafka cluster on premise. My spark application runs on EMR (aws) and persists data onto s3. Before I persist, I need to strip header and convert

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
Hello Experts, I am trying to allow null values in numeric fields. Here are the details of the issue I have: http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields I also tried making all columns nullable by using the below function (from one of

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
sure I am not doing an overkill or overseeing a potential issue. regards Sunita On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > The error in the file I just shared is here: > > val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/&q

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Thanks for the response Sean. I have seen the NPE on similar issues very consistently and assumed that could be the reason :) Thanks for clarifying. regards Sunita On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote: > This usage is fine, because you are only using the

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Ajay, Afaik Generally these contexts cannot be accessed within loops. The sql query itself would run on distributed datasets so it's a parallel execution. Putting them in foreach would make it nested in nested. So serialization would become hard. Not sure I could explain it right. If you can

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
eeper") df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved) LogHandler.log.info("Created the parquet file") } Thanks Sunita On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Attached is the edi

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Sunita On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Thanks for confirming Cody. > To get to use the library, I had to do: > > val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), > "/consumers/topics/"+ t

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
> > I'm honestly not sure specifically what else you are asking at this point. > > On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind <sunitarv...@gmail.com> > wrote: > > Just re-read the kafka architecture. Something that slipped my mind is, > it > > is leader based

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hello Experts, > > I am trying to use the saving to ZK design. Just saw Sudhir's comments > that it is old approach. Any reasons for that? Any issues observed with > saving to ZK. The way we are planning to use it is

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Hello Experts, I am trying to use the saving to ZK design. Just saw Sudhir's comments that it is old approach. Any reasons for that? Any issues observed with saving to ZK. The way we are planning to use it is: 1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka-

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind
Hello Experts, Is there a way to get spark to write to elasticsearch asynchronously? Below are the details http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously regards Sunita

Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sunita Arvind
Hello Experts, For one of our streaming appilcation, we intermittently saw: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Based on what I found on internet and the error

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
Thank you for your inputs. Will test it out and share my findings On Thursday, July 14, 2016, CosminC wrote: > Didn't have the time to investigate much further, but the one thing that > popped out is that partitioning was no longer working on 1.6.1. This would > definitely

Maintain complete state for updateStateByKey

2016-07-06 Thread Sunita Arvind
Hello Experts, I have a requirement of maintaining a list of ids for every customer for all of time. I should be able to provide count distinct ids on demand. All the examples I have seen so far indicate I need to maintain counts directly. My concern is, I will not be able to identify cumulative

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
distribution data sets. Mentioning it here for benefit of anyone else stumbling upon the same issue. regards Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
rds Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the > context, marking it as stoppe

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
Hello Experts, I am getting this error repeatedly: 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.NullPointerException at com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)

Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
Hi Experts, We are trying to get a kafka stream ingested in Spark and expose the registered table over JDBC for querying. Here are some questions: 1. Spark Streaming supports single context per application right? If I have multiple customers and would like to create a kafka topic for each of them

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
s://issues.apache.org/jira/browse/SPARK-10815 "API >>> design: data sources and sinks" is relevant here. >>> >>> In short, it would seem the code is not there yet to create a Kafka-fed >>> Dataframe/Dataset that can be queried with Structured Streaming; or

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
e/Dataset that can be queried with Structured Streaming; or if it > is, it's not obvious how to write such code. > > > -- > *From:* Anthony May <anthony...@gmail.com> > *To:* Deepak Sharma <deepakmc...@gmail.com>; Sunita Arvind < > sunitarv...

Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Hi All, We are evaluating a few real time streaming query engines and spark is my personal choice. The addition of adhoc queries is what is getting me further excited about it, however the talks I have heard so far only mention about it but do not provide details. I need to build a prototype to

Spark SQL - Registerfunction throwing MissingRequirementError in JavaMirror with primordial classloader

2015-04-26 Thread Sunita Arvind
Hi All, I am trying to use a function within spark sql which accepts 2 - 4 arguments. I was able to get through compilation errors however, I see the attached runtime exception when trying from Spark SQL. (refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL) The function

Unable to broadcast dimension tables with Spark SQL

2015-02-16 Thread Sunita Arvind
Hi Experts, I have a large table with 54 million records (fact table), being joined with 6 small tables (dimension tables). The size on disk of small tables is within 5k and the record count is in the range of 4 - 200 All the worker nodes have RAM of 32GB allocated for spark. I have tried the

Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
Hi All We are joining large tables using spark sql and running into shuffle issues. We have explored multiple options - using coalesce to reduce number of partitions, tuning various parameters like disk buffer, reducing data in chunks etc. which all seem to help btw. What I would like to know is,

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind
I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind sunitarv...@gmail.com wrote: Hi, My spark jobs suddenly started getting hung and here

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
The below is not exactly a solution to your question but this is what we are doing. For the first time we do end up doing row.getstring() and we immediately parse it through a map function which aligns it to either a case class or a structType. Then we register it as a table and use just column

Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-17 Thread Sunita Arvind
Hi, My spark jobs suddenly started getting hung and here is the debug leading to it: Following the program, it seems to be stuck whenever I do any collect(), count or rdd.saveAsParquet file. AFAIK, any operation that requires data flow back to master causes this. I increased the memory to 5 MB.

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind
Hi, I need to generate some flags based on certain columns and add it back to the schemaRDD for further operations. Do I have to use case class (reflection or programmatically). I am using parquet files, so schema is being automatically derived. This is a great feature. thanks to Spark

Re: Spark setup on local windows machine

2014-12-02 Thread Sunita Arvind
, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try following this guidelines http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows Thanks Best Regards On Wed, Nov 26, 2014 at 12:24 PM, Sunita Arvind sunitarv...@gmail.com wrote: Hi All, I just

Spark setup on local windows machine

2014-11-25 Thread Sunita Arvind
Hi All, I just installed a spark on my laptop and trying to get spark-shell to work. Here is the error I see: C:\spark\binspark-shell Exception in thread main java.util.NoSuchElementException: key not found: CLAS SPATH at scala.collection.MapLike$class.default(MapLike.scala:228)

GraphX usecases

2014-08-25 Thread Sunita Arvind
Hi, I am exploring GraphX library and trying to determine which usecases make most sense for/with it. From what I initially thought, it looked like GraphX could be applied to data stored in RDBMSs as Spark could translate the relational data into graphical representation. However, there seems to

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-03 Thread Sunita Arvind
-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ Romain Romain On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind sunitarv...@gmail.com javascript:_e(%7B%7D,'cvml','sunitarv...@gmail.com'); wrote: Hello Experts, I am attempting to integrate Spark Editor with Hue