Hi,
I am facing an issue with fixed byte array issue when reading spark
dataframe. spark.sql.parquet.enableVectorizedReader = false is solving my
issue but it is causing significant performance issue. any resolution for
this?
Thanks,
Asmath
Hi,
I am running a spark job in spark EC2 machine whiich has 40 cores. Driver
and executor memory is 16 GB. I am using local[*] but I still get only one
executor(driver). Is there a way to get more executors with this config.
I am not using yarn or mesos in this case. Only one machine which is en
fsets. I assume you were using
> file-based instead of Kafka-based of data sources. Are the incoming data
> generated in mini-batch files or in a single large file? Have you had this
> type of problem before?
>
>> On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote:
>> Hi,
>
Hi,
I am seeing weird behavior in our spark structured streaming application
where the offerts are not getting picked by the streaming job.
If I delete the checkpoint directory and run the job again, I can see the
data for the first batch but it is not picking up new offsets again from
the next
Hi,
I am trying to read data from confluent Kafka using avro schema registry.
Messages are always empty and stream always shows empty records. Any suggestion
on this please ??
Thanks,
Asmath
-
To unsubscribe e-mail: user-uns
Hi,
What is the equivalent function using dataframe in spark. I was able to
make it work for spark sql but looking to use dataframes instead.
df11=self.spark.sql("""SELECT transaction_card_bin,(CASE WHEN
transaction_card_bin REGEXP '^5[1-5][\d]*' THEN "MC"
WHEN transaction_card_bin REGEXP '^4[
Hi,
I have written a sample spark job that reads the data residing in hbase. I
keep getting below error , any suggestions to resolve this please?
Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified by setting the fs.s3.awsAccessKeyId and
fs.s3
Thanks everyone. I was able to resolve this.
Here is what I did. Just passed conf file using —files option.
Mistake that I did was reading the json conf file before creating spark session
. Reading if after creating spark session helped it. Thanks once again for your
valuable suggestions
Tha
7g
--executor-memory 7g /appl/common/ftp/ftp_event_data.py
/appl/common/ftp/conf.json 2021-05-10 7
On Fri, May 14, 2021 at 6:19 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Sorry my bad, it did not resolve the issue. I still have the same issue.
> can anyone please gui
Sorry my bad, it did not resolve the issue. I still have the same issue.
can anyone please guide me. I was still running as a client instead of a
cluster.
On Fri, May 14, 2021 at 5:05 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> You are right. It worked but I st
You are right. It worked but I still don't understand why I need to pass
that to all executors.
On Fri, May 14, 2021 at 5:03 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> I am using json only to read properties before calling spark session. I
> don't know wh
nt local FS)
> /appl/common/ftp/conf.json
>
>
>
>
>
> *From: *KhajaAsmath Mohammed
> *Date: *Friday, May 14, 2021 at 4:50 PM
> *To: *"user @spark"
> *Subject: *[EXTERNAL] Urgent Help - Py Spark submit error
>
>
>
> /appl/common/ftp/conf.json
>
Hi,
I am having a weird situation where the below command works when the
deploy mode is a client and fails if it is a cluster.
spark-submit --master yarn --deploy-mode client --files
/etc/hive/conf/hive-site.xml,/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
--driver-memory 70g --n
Hi,
I have tried multiple ways to use hbase-spark and none of them works as
expected. SHC and hbase-spark library are loading all the data on executors
and it is running for ever.
https://ramottamado.dev/how-to-use-hbase-fuzzyrowfilter-in-spark/
Above link has the solution that I am looking for
Hi,
Spark submit is connecting to local host instead of zookeeper mentioned in
hbase-site.xml. This same program works in ide which gets connected to
hbase-site.xml. What am I missing in spark submit?
>
>
> spark-submit --driver-class-path
> C:\Users\mdkha\bitbucket\clx-spark-scripts\src\te
Hi,
I am having similar issue as mentioned in the below link but was not able
to resolve. any other solutions please?
https://stackoverflow.com/questions/57329060/exclusion-of-dependency-of-spark-core-in-cdh
Thanks,
Asmath
Hi,
I am having issue when running custom
Applications on spark2.4. I was able to
Run successfully on windows ide but cannot run this in emr spark2.4. I get
jsonmethods not found error.
I have included json4s in Uber jar still I get this error. Any solution to
resolve this?
Thanks,
Asmath
---
I was able to Resolve this by changing the Hdfs-site.xml as I mentioned in my
initial thread
Thanks,
Asmath
> On Apr 12, 2021, at 8:35 PM, Peng Lei wrote:
>
>
> Hi KhajaAsmath Mohammed
> Please check the configuration of "spark.speculation.interval", ju
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:936)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession
Thanks,
Asmath
On Mon, Apr 12, 2021 at 2:20 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> I a
but give it "30s".
> Is it a time property somewhere that really just wants MS or something? But
> most time properties (all?) in Spark should accept that type of input anyway.
> Really depends on what property has a problem and what is setting it.
>
>> On Mon,
HI,
I am getting weird error when running spark job in emr cluster. Same
program runs in my local machine. Is there anything that I need to do to
resolve this?
21/04/12 18:48:45 ERROR SparkContext: Error initializing SparkContext.
java.lang.NumberFormatException: For input string: "30s"
I tried
Hi,
I am trying to connect hbase which sits on top of hive as external table. I
am getting below exception. Am I missing anything to pass here?
21/04/09 18:08:11 INFO ZooKeeper: Client environment:user.dir=/
21/04/09 18:08:11 INFO ZooKeeper: Initiating client connection,
connectString=localhost:
for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>> On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed
>> wrote:
>> Hi,
>>
>> I have 10gb file that should be loaded into spark dataframe. This file is
>> csv with h
Thanks Mich. I understood what I am supposed to do now, will try these
options.
I still dont understand how the spark will split the large file. I have a
10 GB file which I want to split automatically after reading. I can split
and load the file before reading but it is a very big requirement chan
format(“avro”).save(...)
>
>
>> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed
>> wrote:
>>
>> Hi,
>>
>> I have 10gb file that should be loaded into spark dataframe. This file is
>> csv with header and we were using rdd.zipwithindex to get colu
Hi,
I have 10gb file that should be loaded into spark dataframe. This file is csv
with header and we were using rdd.zipwithindex to get column names and convert
to avro accordingly.
I am assuming this is taking long time and only executor runs and never
achieves parallelism. Is there a easy w
Thanks Sean.I just realized it. Let me try that.
On Mon, Mar 22, 2021 at 12:31 PM Sean Owen wrote:
> You need to do something with the result of repartition. You haven't
> changed textDF
>
> On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com&g
Hi,
I have a use case where there are large files in hdfs.
Size of the file is 3 GB.
It is an existing code in production and I am trying to improve the
performance of the job.
Sample Code:
textDF=dataframe ( This is dataframe that got created from hdfs path)
logging.info("Number of partitions"
flume
> transaction bache size.
>
>
> KhajaAsmath Mohammed 于2020年10月21日周三 下午12:19写道:
>> Yes. Changing back to latest worked but I still see the slowness compared to
>> flume.
>>
>> Sent from my iPhone
>>
>>>> On Oct 20, 2020, at 10:21 PM, lec ssmi
2:19写道:
>> Are you getting any output? Streaming jobs typically run forever, and keep
>> processing data as it comes in the input. If a streaming job is working
>> well, it will typically generate output at a certain cadence
>>
>>
>>
>> From: KhajaAsmath M
Hi,
I have started using spark structured streaming for reading data from kaka
and the job is very slow. Number of output rows keeps increasing in query 0
and the job is running forever. any suggestions for this please?
[image: image.png]
Thanks,
Asmath
Hi,
I am looking for some information on how to read database which has oauth
authentication with spark -jdbc. any links that point to this approach
would be really helpful
Thanks,
Asmath
Hi,
>
> We have a producer application that has written data to Kafka topic.
>
> We are reading the data from Kafka topic using spark streaming but the time
> stamp on Kafka is 1969-12-31 format for all the data.
>
> Is there a way to fix this while reading ?
>
> Thanks,
> Asmath
>
-
Hi ,
I am looking for some information on how to gracefully kill the spark
structured streaming kafka job and redeploy it.
How to kill a spark structured job in YARN?
any suggestions on how to kill gracefully?
I was able to monitor the job from SQL tab but how can I kill this job when
deployed i
Hi,
When doing application upgrade for spark structured streaming, do we need to
delete the checkpoint or does it start consuming offsets from the point we left?
kafka source we need to use the option "StartingOffsets" with a json string like
""" {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
f Kafka's view, especially the gap between highest
> offset and committed offset.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi
> wrote:
>
>> In 3.0 the community just added it.
>>
>>
Hi,
We are trying to move our existing code from spark dstreams to structured
streaming for one of the old application which we built few years ago.
Structured streaming job doesn’t have streaming tab in sparkui. Is there a way
to monitor the job submitted by us in structured streaming ? Since
Hi,
What is the best approach to loop through 3 dataframes in scala based on
some keys instead of using collect.
Thanks,
Asmath
artingOffsets contains specific offsets, you must specify all
> TopicPartitions.
>
> BR,
> G
>
>
> On Tue, May 21, 2019 at 9:16 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I am getting below errror when running sampl
Hi,
I am getting below errror when running sample strreaming app. does anyone
have resolution for this?
JSON OFFSET {"test":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0}}
- Herreee
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
|-- topic: string (nullable = true)
|--
Hi
I have followed link
https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/77824
to
connect teradata from spark.
I was able to print schema if I give table name instead of sql query.
I am getting below error if I give query(code sn
Hi,
I have teradata table who has more than 2.5 billion records and data size
is around 600 GB. I am not able to pull efficiently using spark SQL and it
is been running for more than 11 hours. here is my code.
val df2 = sparkSession.read.format("jdbc")
.option("url", "jdbc:teradata://PROD/D
Hi,
I am getting the below exception when using Spark Kmeans. Any solutions
from the experts. Would be really helpful.
val kMeans = new KMeans().setK(reductionCount).setMaxIter(30)
val kMeansModel = kMeans.fit(df)
Error is occured when calling kmeans.fit
Exception in thread "main" java.la
Hi,
I am new to the structured streaming but used dstreams a lot. Difference I
saw in the spark UI is the streaming tab for dstreams.
Is there a way to know how many records and batches were executed in
structred streaming and also any option on how to see streaming tab?
Thanks,
Asmath
Hi Cody,
I am following to implement the exactly once semantics and also utilize
storing the offsets in database. Question I have is how to use hive instead
of traditional datastores. write to hive will be successful even though
there is any issue with saving offsets into DB. Could you please corr
Hi,
I have enabled dynamic allocation for spark streaming application but the
number of containers always shows as 2. Is there a way to get more when job
is running?
Thanks,
Asmath
Hi,
Is there any best approach to reduce shuffling in spark. I have two tables
and both of them are large. any suggestions? I saw only about broadcast but
that will not work in my case.
Thanks,
Asmath
I am also looking for the same answer. Will this work in streaming application
too ??
Sent from my iPhone
> On Feb 12, 2018, at 8:12 AM, Debabrata Ghosh wrote:
>
> Georg - Thanks ! Will you be able to help me with a few examples please.
>
> Thanks in advance again !
>
> Cheers,
> D
>
>> On
Hi,
Could anyone please share example of on how to use spark structured streaming
with kafka and write data into hive. Versions that I do have are
Spark 2.1 on CDH5.10
Kafka 0.9
Thanks,
Asmath
Sent from my iPhone
-
To unsubsc
Hi,
I have written spark streaming job to use the checkpoint. I have stopped
the streaming job for 5 days and then restart it today.
I have encountered weird issue where it shows as zero records for all
cycles till date. is it causing data loss?
[image: Inline image 1]
Thanks,
Asmath
Hi,
I have been using the spark streaming with kafka. I have to restart the
application daily due to kms issue and after restart the offsets are not
matching with the point I left. I am creating checkpoint directory with
val streamingContext = StreamingContext.getOrCreate(checkPointDir, () =>
cre
Hi,
Could anyone please provide your thoughts on how to kill spark streaming
application gracefully.
I followed link of
http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html
https://github.com/lanjiang/streamingstopgraceful
I played around with having either p
Any solutions for this problem please .
Sent from my iPhone
> On Jan 17, 2018, at 10:39 PM, KhajaAsmath Mohammed
> wrote:
>
> Hi,
>
> I have created a streaming object from checkpoint but it always through up
> error as stream corrupted when I restart spark stream
Hi,
I have created a streaming object from checkpoint but it always through up
error as stream corrupted when I restart spark streaming job. any solution
for this?
private def createStreamingContext(
sparkCheckpointDir: String, sparkSession: SparkSession,
batchDuration: Int, config: com.t
It could be a missing persist before the checkpoint
>
> > On 16. Jan 2018, at 22:04, KhajaAsmath Mohammed
> wrote:
> >
> > Hi,
> >
> > Spark streaming job from kafka is not picking the messages and is always
> taking the latest offsets when streaming job is sto
Hi,
Spark streaming job from kafka is not picking the messages and is always
taking the latest offsets when streaming job is stopped for 2 hours. It is
not picking up the offsets that are required to be processed from
checkpoint directory. any suggestions on how to process the old messages
too wh
Hi,
I keep getting null pointer exception in the spark streaming job with
checkpointing. any suggestions to resolve this.
Exception in thread "pool-22-thread-9" java.lang.NullPointerException
at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)
at
java
I am looking for same answer too .. will wait for response from other people
Sent from my iPhone
> On Dec 18, 2017, at 10:56 PM, Gaurav1809 wrote:
>
> Hi All,
>
> Will Bigdata tools & technology work with Blockchain in future? Any possible
> use cases that anyone is likely to face, please sha
Hi,
I am running spark sql job and it completes without any issues. I am
getting errors as
ERROR: SparkListenerBus has already stopped! Dropping event
SparkListenerExecutorMetricsUpdate after completion of job. could anyone
share your suggestions on how to avoid it.
Thanks,
Asmath
This didnt work. I tried it but no luck.
On Wed, Nov 29, 2017 at 7:49 PM, Vadim Semenov
wrote:
> You can pass `JAVA_HOME` environment variable
>
> `spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0`
>
> On Wed, Nov 29, 2017 at 10:54 AM, KhajaAsmath Mohammed <
> mdkhajaas
Hi,
I am running cloudera version of spark2.1 and our cluster is on JDK1.7. For
some of the libraries, I need JDK1.8, is there a way to set to run Spark
worker in JDK1.8 without upgrading .
I was able run driver in JDK 1.8 by setting the path but not the workers.
17/11/28 20:22:27 WARN scheduler
[image: Inline image 1]
This is what we are on.
On Wed, Nov 22, 2017 at 12:33 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> We use oracle JDK. we are on unix.
>
> On Wed, Nov 22, 2017 at 12:31 PM, Georg Heiler
> wrote:
>
>> Do you use oracle or open jd
these installed?
>
> KhajaAsmath Mohammed schrieb am Mi. 22. Nov.
> 2017 um 19:29:
>
>> I passed keytab, renewal is enabled by running the script every eight
>> hours. User gets renewed by the script every eight hours.
>>
>> On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler
I passed keytab, renewal is enabled by running the script every eight
hours. User gets renewed by the script every eight hours.
On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler
wrote:
> Did you pass a keytab? Is renewal enabled in your kdc?
> KhajaAsmath Mohammed schrieb am Mi. 22. Nov.
Hi,
I have written spark stream job and job is running successfully for more
than 36 hours. After around 36 hours job gets failed with kerberos issue.
Any solution on how to resolve it.
org.apache.spark.SparkException: Task failed while wri\
ting rows.
at
org.apache.spark.sql.hi
Hi,
I am able to wirte data into hive tables from spark stremaing. Job ran
successfully for 37 hours and I started getting errors in task failure as
below. Hive table has data too untill tasks are failed.
Job aborted due to stage failure: Task 0 in stage 691.0 failed 4 times,
most recent failure:
Hi,
I am running spark streaming job and it is not picking up the next batches
but the job is still shows as running on YARN.
is this expected behavior if there is no data or waiting for data to pick
up?
I am almost behind 4 hours of batches (30 min interval)
[image: Inline image 1]
[image: I
Hi,
I have following schema in dataframe and I want to extract key which
matches as MaxSpeed from the array and it's corresponding value of the key.
|-- tags: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- key: string (nullable = true)
|||-- value:
Here is screenshot . Status shows finished but it should be running for
next batch to pick up the data.
[image: Inline image 1]
On Thu, Nov 16, 2017 at 10:01 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Hi,
>
> I have scheduled spark streaming job to run every 30
Hi,
I have scheduled spark streaming job to run every 30 minutes and it was
running fine till 32 hours and suddenly I see status of Finsished instead
of running (Since it always run in background and shows up in resource
manager)
Am i doing anything wrong here? how come job was finished without p
Hi,
I am new in the usage of spark streaming. I have developed one spark
streaming job which runs every 30 minutes with checkpointing directory.
I have to implement minor change, shall I kill the spark streaming job once
the batch is completed using yarn application -kill command and update the
j
Hi,
I am not successful when using using spark 2.1 with Kafka 0.9, can anyone
please share the code snippet to use it.
val sparkSession: SparkSession = runMode match {
case "local" => SparkSession.builder.config(sparkConfig).getOrCreate
case "yarn" =>
SparkSession.builder.config(sparkConfig)
Hi,
I am using spark streaming to write data back into hive with the below code
snippet
eventHubsWindowedStream.map(x => EventContent(new String(x)))
.foreachRDD(rdd => {
val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate
import sparkSession.implicits
y("creationTime").start()* to
> *format("console").start().*
>
> On Fri, Oct 27, 2017 at 8:41 AM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi TathagataDas,
>>
>> I was trying to use eventhub with spark streaming. Looks like I
/spark/sql/examples/EventHubsStructuredStreamingExample.scala
is the code snippet I have used to connect to eventhub
Thanks,
Asmath
On Thu, Oct 26, 2017 at 9:39 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Thanks TD.
>
> On Wed, Oct 25, 2017 at 6:42 PM
hub support seuctured streaming at all yet?
>
> On Fri, 27 Oct 2017 at 1:43 pm, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> Could anyone share if there is any code snippet on how to use spark
>> structured streaming with event hu
Hi,
Could anyone share if there is any code snippet on how to use spark structured
streaming with event hubs ??
Thanks,
Asmath
Sent from my iPhone
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
/spark-summit.org/
> 2017/speakers/tathagata-das/
>
> On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Thanks Subhash.
>>
>> Have you ever used zero data loss concept with streaming. I am bit
>> w
, Oct 25, 2017 at 3:10 PM, Subhash Sriram
wrote:
> No problem! Take a look at this:
>
> http://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html#recovering-from-failures-with-checkpointing
>
> Thanks,
> Subhash
>
> On Wed, Oct 25, 2017 at 4:
lp:
>
> https://databricks.com/blog/2017/02/23/working-complex-
> data-formats-structured-streaming-apache-spark-2-1.html
>
> I hope this helps.
>
> Thanks,
> Subhash
>
> On Wed, Oct 25, 2017 at 2:59 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
Hi,
Could anyone provide suggestions on how to parse json data from kafka and
load it back in hive.
I have read about structured streaming but didn't find any examples. is
there any best practise on how to read it and parse it with structured
streaming for this use case?
Thanks,
Asmath
ssion.sql(deltaDSQry)
Here is the code and also properties used in my project.
On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu
wrote:
> Can you share some code?
>
> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed,
> wrote:
>
>> In my case I am just writing the data fra
the
> shuffle as that one will take the value you've set
>
> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Yes still I see more number of part files and exactly the number I have
>> defined did spark.sql.shuffle.partitio
at 3:45 AM, Tushar Adeshara
>>> wrote:
>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>
>>> Regards,
>>> Tushar Adeshara
>>>
>>> Technical Specialist – Analytics Practice
>>>
>>> Cel
w.persistentsys.com
> <http://www.persistentsys.com/>*
>
>
> --
> *From:* KhajaAsmath Mohammed
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wi
Hi,
I am reading hive query and wiriting the data back into hive after doing
some transformations.
I have changed setting spark.sql.shuffle.partitions to 2000 and since then
job completes fast but the main problem is I am getting 2000 files for each
partition
size of file is 10 MB .
is there a w
Hi,
I am getting below error when trying to cast column value from spark
dataframe to double. any issues. I tried many solutions but none of them
worked.
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Double
1. row.getAs[Double](Constants.Datapoint.Latitude)
2. row.
Hi,
I am getting java.lang.OutOfMemoryError: Java heap space error whenever I
ran the spark sql job.
I came to conclusion issue is because of reading number of files from spark.
I am reading 37 partitions and each partition has around 2000 files with
filesize more than 128 MB 37*2000 files from
Hi,
I am initiating arraylist before iterating throuugh the map method. I am
always getting the list size value as zero after map operation.
How do I add values to list inside the map method of dataframe ? any
suggestions?
val points = new
java.util.ArrayList[com.vividsolutions.jts.geom.Coordin
n Sun, Aug 20, 2017 at 1:35 PM, ayan guha wrote:
> Just curious - is your dataset partitioned on your partition columns?
>
> On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> We are in cloudera CDH5.10 and we are using spark 2 tha
e much more performant with much more
> features.
>
> Alternatively you can try to register the df as temptable and do a insert
> into the Hive table from the temptable using Spark sql ("insert into table
> hivetable select * from temptable")
>
> On 20. Aug 2017, at 1
apache.org/jira/browse/SPARK-20049
I saw something in the above link not sure if that is same thing in my case.
Thanks,
Asmath
On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke wrote:
> Have you made sure that the saveastable stores them as parquet?
>
> On 20. Aug 2017, at 18:07, KhajaAsmat
> writes to Hive. One issue for the slow storing into a hive table could be
> that it writes by default to csv/gzip or csv/bzip2
>
> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed
> wrote:
> >
> > Yes we tried hive and want to migrate to spark for better performanc
> In which Format do you expect Hive to write? Have you made sure it is in this
> format? It could be that you use an inefficient format (e.g. CSV + bzip2).
>
>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed
>> wrote:
>>
>> Hi,
>>
>> I have writ
Hi,
I have written spark sql job on spark2.0 by using scala . It is just pulling
the data from hive table and add extra columns , remove duplicates and then
write it back to hive again.
In spark ui, it is taking almost 40 minutes to write 400 go of data. Is there
anything that I need to improv
gt;
> What’s more, you can use tools like the Spark UI or Ganglia to determine
> which step is failing and why. What is the overall cluster size? How many
> executors do you have? Is it an appropriate count for this cluster’s cores?
> I’m assuming you are using YARN?
>
> -Pa
Hi,
I am using persit before inserting dataframe data back into hive. This step
is adding 8 minutes to my total execution time. is there a way to reduce
the total time without resulting in out of memory issues. Here is my code.
val datapoint_df: Dataset[Row] = sparkSession.sql(transposeHiveQry);
> On Aug 17, 2017, at 11:43 PM, Pralabh Kumar wrote:
>
> what's is your exector memory , please share the code also
>
>> On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed
>> wrote:
>>
>> HI,
>>
>> I am getting below error when running
HI,
I am getting below error when running spark sql jobs. This error is thrown
after running 80% of tasks. any solution?
spark.storage.memoryFraction=0.4
spark.sql.shuffle.partitions=2000
spark.default.parallelism=100
#spark.eventLog.enabled=false
#spark.scheduler.revive.interval=1s
spark.driver.
are all the
> files you see?
>
> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> tempTable = union_df.registerTempTable("tempRaw")
>>
>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dp
1 - 100 of 156 matches
Mail list logo