Hello Spark Experts,
I am having challenges using the DataSource V2 API. I created a mock
The input partitions seem to be created correctly. The below output
confirms that:
19/06/23 16:00:21 INFO root: createInputPartitions
19/06/23 16:00:21 INFO root: Create a partition for abc
The
Had a similar situation and landed on this question.
Finally I was able to make it do what I needed by cheating the spark driver
:)
i.e By setting a very high value for "--conf spark.task.maxFailures=800".
I made it 800 deliberately which typically is 4. So by the time 800
attempts for failed
(UnsupportedOperationChecker.scala:297)
regards
Sunita
On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust <mich...@databricks.com>
wrote:
> You specify the schema when loading a dataframe by calling
> spark.read.schema(...)...
>
> On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind <
Hello Experts,
I am required to use a specific user id to save files on a remote hdfs
cluster. Remote in the sense, spark jobs run on EMR and write to a CDH
cluster. Hence I cannot change the hdfs-site.xml etc to point to the
destination cluster. As a result I am using webhdfs to save the files
> Le 13 sept. 2017 01:51, "Sunita Arvind" <sunitarv...@gmail.com> a écrit :
>
> Hi Michael,
>
> I am wondering what I am doing wrong. I get error like:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Schema
> must be specified when
nt to S3. In my laptop,
they all point to local filesystem.
I am using Spark2.2.0
Appreciate your help.
regards
Sunita
On Wed, Aug 23, 2017 at 2:30 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> If you use structured streaming and the file sink, you can have a
> subsequent stream rea
cond job in the chain is only a
> few mins, you may have to end up creating a new file every few mins
>
> You may want to consider Kafka as your intermediary store for building a
> chain/DAG of streaming jobs
>
> On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind <sunitarv...@gmail.
tain exactly
> once processing even if there are hiccups or failures.
>
> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com>
> wrote:
>
>> Hello Spark Experts,
>>
>> I have a design question w.r.t Spark Streaming. I have a streaming job
>>
Hello Spark Experts,
I have a design question w.r.t Spark Streaming. I have a streaming job that
consumes protocol buffer encoded real time logs from a Kafka cluster on
premise. My spark application runs on EMR (aws) and persists data onto s3.
Before I persist, I need to strip header and convert
Hello Experts,
I am trying to allow null values in numeric fields. Here are the details of
the issue I have:
http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields
I also tried making all columns nullable by using the below function (from
one of
sure I am not doing an overkill or overseeing a
potential issue.
regards
Sunita
On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> The error in the file I just shared is here:
>
> val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/&q
Thanks for the response Sean. I have seen the NPE on similar issues very
consistently and assumed that could be the reason :) Thanks for clarifying.
regards
Sunita
On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote:
> This usage is fine, because you are only using the
Ajay,
Afaik Generally these contexts cannot be accessed within loops. The sql
query itself would run on distributed datasets so it's a parallel
execution. Putting them in foreach would make it nested in nested. So
serialization would become hard. Not sure I could explain it right.
If you can
eeper")
df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved)
LogHandler.log.info("Created the parquet file")
}
Thanks
Sunita
On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> Attached is the edi
Sunita
On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> Thanks for confirming Cody.
> To get to use the library, I had to do:
>
> val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"),
> "/consumers/topics/"+ t
>
> I'm honestly not sure specifically what else you are asking at this point.
>
> On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind <sunitarv...@gmail.com>
> wrote:
> > Just re-read the kafka architecture. Something that slipped my mind is,
> it
> > is leader based
, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> Hello Experts,
>
> I am trying to use the saving to ZK design. Just saw Sudhir's comments
> that it is old approach. Any reasons for that? Any issues observed with
> saving to ZK. The way we are planning to use it is
Hello Experts,
I am trying to use the saving to ZK design. Just saw Sudhir's comments that
it is old approach. Any reasons for that? Any issues observed with saving
to ZK. The way we are planning to use it is:
1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka-
Hello Experts,
Is there a way to get spark to write to elasticsearch asynchronously?
Below are the details
http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously
regards
Sunita
Hello Experts,
For one of our streaming appilcation, we intermittently saw:
WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory
limits. 12.0 GB of 12 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
Based on what I found on internet and the error
Thank you for your inputs. Will test it out and share my findings
On Thursday, July 14, 2016, CosminC wrote:
> Didn't have the time to investigate much further, but the one thing that
> popped out is that partitioning was no longer working on 1.6.1. This would
> definitely
Hello Experts,
I have a requirement of maintaining a list of ids for every customer for
all of time. I should be able to provide count distinct ids on demand. All
the examples I have seen so far indicate I need to maintain counts
directly. My concern is, I will not be able to identify cumulative
distribution data sets. Mentioning it here for
benefit of anyone else stumbling upon the same issue.
regards
Sunita
On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16
rds
Sunita
On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind <sunitarv...@gmail.com>
wrote:
> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the
> context, marking it as stoppe
Hello Experts,
I am getting this error repeatedly:
16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the
context, marking it as stopped
java.lang.NullPointerException
at
com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
Hi Experts,
We are trying to get a kafka stream ingested in Spark and expose the
registered table over JDBC for querying. Here are some questions:
1. Spark Streaming supports single context per application right? If I have
multiple customers and would like to create a kafka topic for each of them
s://issues.apache.org/jira/browse/SPARK-10815 "API
>>> design: data sources and sinks" is relevant here.
>>>
>>> In short, it would seem the code is not there yet to create a Kafka-fed
>>> Dataframe/Dataset that can be queried with Structured Streaming; or
e/Dataset that can be queried with Structured Streaming; or if it
> is, it's not obvious how to write such code.
>
>
> --
> *From:* Anthony May <anthony...@gmail.com>
> *To:* Deepak Sharma <deepakmc...@gmail.com>; Sunita Arvind <
> sunitarv...
Hi All,
We are evaluating a few real time streaming query engines and spark is my
personal choice. The addition of adhoc queries is what is getting me
further excited about it, however the talks I have heard so far only
mention about it but do not provide details. I need to build a prototype to
Hi All,
I am trying to use a function within spark sql which accepts 2 - 4
arguments. I was able to get through compilation errors however, I see the
attached runtime exception when trying from Spark SQL.
(refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL)
The function
Hi Experts,
I have a large table with 54 million records (fact table), being joined
with 6 small tables (dimension tables). The size on disk of small tables is
within 5k and the record count is in the range of 4 - 200
All the worker nodes have RAM of 32GB allocated for spark. I have tried the
Hi All
We are joining large tables using spark sql and running into shuffle
issues. We have explored multiple options - using coalesce to reduce number
of partitions, tuning various parameters like disk buffer, reducing data in
chunks etc. which all seem to help btw. What I would like to know is,
I was able to resolve this by adding rdd.collect() after every stage. This
enforced RDD evaluation and helped avoid the choke point.
regards
Sunita Kopppar
On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind sunitarv...@gmail.com
wrote:
Hi,
My spark jobs suddenly started getting hung and here
The below is not exactly a solution to your question but this is what we
are doing. For the first time we do end up doing row.getstring() and we
immediately parse it through a map function which aligns it to either a
case class or a structType. Then we register it as a table and use just
column
Hi,
My spark jobs suddenly started getting hung and here is the debug leading
to it:
Following the program, it seems to be stuck whenever I do any collect(),
count or rdd.saveAsParquet file. AFAIK, any operation that requires data
flow back to master causes this. I increased the memory to 5 MB.
Hi,
I need to generate some flags based on certain columns and add it back to
the schemaRDD for further operations. Do I have to use case class
(reflection or programmatically). I am using parquet files, so schema is
being automatically derived. This is a great feature. thanks to Spark
, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You could try following this guidelines
http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows
Thanks
Best Regards
On Wed, Nov 26, 2014 at 12:24 PM, Sunita Arvind sunitarv...@gmail.com
wrote:
Hi All,
I just
Hi All,
I just installed a spark on my laptop and trying to get spark-shell to
work. Here is the error I see:
C:\spark\binspark-shell
Exception in thread main java.util.NoSuchElementException: key not found:
CLAS
SPATH
at scala.collection.MapLike$class.default(MapLike.scala:228)
Hi,
I am exploring GraphX library and trying to determine which usecases make
most sense for/with it. From what I initially thought, it looked like
GraphX could be applied to data stored in RDBMSs as Spark could translate
the relational data into graphical representation. However, there seems to
-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
Romain
Romain
On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind sunitarv...@gmail.com
javascript:_e(%7B%7D,'cvml','sunitarv...@gmail.com'); wrote:
Hello Experts,
I am attempting to integrate Spark Editor with Hue
40 matches
Mail list logo