Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
Following link you will get all required details https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ Let me know if you required further informations. Regards, Vaquar khan On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh wrote: > Couple of points >

Re: Online classes for spark topics

2023-03-12 Thread vaquar khan
I saw you are looking holden video .please find following link. https://www.oreilly.com/library/view/debugging-apache-spark/9781492039174/ Regards, Vaquar khan On Sun, Mar 12, 2023, 6:56 PM Mich Talebzadeh wrote: > Hi Denny, > > Thanks for the offer. How do you envisage that

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
@ Gourav Sengupta why you are sending unnecessary emails ,if you think snowflake good plz use it ,here question was different and you are talking totally different topic. Plz respects group guidelines Regards, Vaquar khan On Wed, Dec 28, 2022, 10:29 AM vaquar khan wrote: > Here you can f

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
Here you can find all details , you just need to pass spark dataframe and deequ also generate recommendations for rules and you can also write custom complex rules. https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ Regards, Vaquar khan On Wed, Dec 28, 2022, 9:40 AM

Re: Profiling data quality with Spark

2022-12-27 Thread vaquar khan
I would suggest Deequ , I have implemented many time easy and effective. Regards, Vaquar khan On Tue, Dec 27, 2022, 10:30 PM ayan guha wrote: > The way I would approach is to evaluate GE, Deequ (there is a python > binding called pydeequ) and others like Delta Live tables with expect

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread vaquar khan
eaucoup mes amis :) > > [1] https://stackoverflow.com/q/66933229/1305344 > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski >

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread vaquar khan
HI Pedro, What is your usecase ,why you used coqlesce ,coalesce() is very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible. Regards, Vaquar khan On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero wrote: > I was review

Re: How to submit a job via REST API?

2020-11-24 Thread vaquar khan
Hi Yang, Please find following link https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337 Regards, Vaquar khan On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal wrote: > You should be able to supply the --conf and its values as part of appA

Re: Read text file row by row and apply conditions

2019-09-30 Thread vaquar khan
Hi Swetha, It would be great if you ask same question in stackoverflow , we have very active community and monitor stack for each spark questions. If you ask same question via stack other ppl also get benefits for similar problems. Regards, Vaquar khan On Sun, Sep 29, 2019, 10:26 PM swetha

Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan
Hi Deepak, You can use textFileStream. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Plz start using stackoverflow to ask question to other ppl so get benefits of answer Regards, Vaquar khan On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > I am using sp

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
Sure let me check Jira Regards, Vaquar khan On Thu, Jun 21, 2018, 4:42 PM Takeshi Yamamuro wrote: > In this ticket SPARK-24201, the ambiguous statement in the doc had been > pointed out. > can you make pr for that? > > On Fri, Jun 22, 2018 at 6:17 AM, vaquar khan >

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
sion (2.11.x). Regards, Vaquar khan On Thu, Jun 21, 2018 at 11:56 AM, chriswakare < chris.newski...@intellibridge.co> wrote: > Hi Rahul, > This will work only in Java 8. > Installation does not work with both version 9 and 10 > > Thanks, > Christopher > > > >

Re: G1GC vs ParallelGC

2018-06-20 Thread vaquar khan
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html Regards, Vaquar khan On Wed, Jun 20, 2018, 1:18 AM Aakash Basu wrote: > Hi guys, > > I just wanted to know, why my ParallelGC (*--conf > "spark.executor.extraJavaOptions=-

Re: load hbase data using spark

2018-06-20 Thread vaquar khan
Why you need tool,you can directly connect Hbase using spark. Regards, Vaquar khan On Jun 18, 2018 4:37 PM, "Lian Jiang" wrote: Hi, I am considering tools to load hbase data using spark. One choice is https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to be o

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread vaquar khan
persist or any other logical separation in pipeline. Regards, Vaquar khan On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny wrote: > Hi Akash, > such errors might appear in large spark pipelines, the root cause is a > 64kb jvm limitation. > the reason that your job isn't failing at th

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread vaquar khan
Hi Akash, Please check stackoverflow. https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe Regards, Vaquar khan On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu wrote: > Hi guys, > > I'm getting an error when I'

Re: Not able to sort out environment settings to start spark from windows

2018-06-16 Thread vaquar khan
Plz check ur Java Home path . May be spacial char or space on ur path. Regards, Vaquar khan On Sat, Jun 16, 2018, 1:36 PM Raymond Xie wrote: > I am trying to run spark-shell in Windows but receive error of: > > \Java\jre1.8.0_151\bin\java was unexpected at this time. > &

Re: spark optimized pagination

2018-06-11 Thread vaquar khan
of records will be big delay in response. Regards, Vaquar khan On Mon, Jun 11, 2018, 2:59 AM Teemu Heikkilä wrote: > So you are now providing the data on-demand through spark? > > I suggest you change your API to query from cassandra and store the > results from Spark back there,

Re: Process large JSON file without causing OOM

2017-11-13 Thread vaquar khan
https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory Regards, Vaquar khan On Mon, Nov 13, 2017 at 6:22 PM, Alec Swan <alecs...@gmail.com> wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format. Ef

Re: Use of Accumulators

2017-11-13 Thread vaquar khan
Confirmed ,you can use Accumulators :) Regards, Vaquar khan On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit < kedarnath_di...@persistent.com> wrote: > Hi, > > > We need some way to toggle the flag of a variable in transformation. > > > We are thinking to make

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread vaquar khan
as an argument of textFile the path of the file in the worker filesystem. Regards, Vaquar khan On Fri, Sep 29, 2017 at 2:00 PM, JG Perrin <jper...@lumeris.com> wrote: > On a test system, you can also use something like > Owncloud/Nextcloud/Dropbox to insure that the files are synchro

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-23 Thread vaquar khan
http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide Regards, Vaquar khan On Fri, Sep 22, 2017 at 4:41 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Thanks for the reply. Forgot to mention that, our Batch ETL Jobs are in > Core-Spark. > > >

Re: Apache Spark - MLLib challenges

2017-09-23 Thread vaquar khan
entered into maintenance mode. Regards, Vaquar khan On Sat, Sep 23, 2017 at 4:04 PM, Koert Kuipers <ko...@tresata.com> wrote: > our main challenge has been the lack of support for missing values > generally > > On Sat, Sep 23, 2017 at 3:41 AM, Irfan Kabli <irfan.kabli.

Re: Do we always need to go through spark-submit?

2017-08-30 Thread vaquar khan
RIVER_MEMORY, "2g") .launch(); spark.waitFor(); } } *Note :* a user application is launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy mo

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread vaquar khan
://ampcamp.berkeley.edu/6/exercises/time-series-tutorial-taxis.html Regards, Vaquar khan On Wed, Aug 30, 2017 at 1:21 PM, Irving Duran <irving.du...@gmail.com> wrote: > I think it will work. Might want to explore spark streams. > > > Thank You, > > Irving Duran > > On Wed, Au

Re: Spark 2.1.1 Error:java.lang.NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;

2017-07-17 Thread vaquar khan
Following error we are getting because of dependency mismatch. Regards, vaquar khan On Jul 17, 2017 3:50 AM, "zzcclp" <441586...@qq.com> wrote: Hi guys: I am using spark 2.1.1 to test on CDH 5.7.1, when i run on yarn with following command, error 'N

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vaquar khan
dashboards. In fact, you can apply Spark’s machine learning <https://spark.apache.org/docs/latest/ml-guide.html> and graph processing <https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams. Regards, Vaquar khan On Sun, Jun 11, 2017 at 3:12 AM,

Re: Read Data From NFS

2017-06-11 Thread vaquar khan
for memory growth). A simple check that the file can be read would be: sc.textFile(file, numPartitions).count() You can get good explanation here : https://stackoverflow.com/questions/29011574/how-does- partitioning-work-for-data-from-files-on-hdfs Regards, Vaquar khan On Jun 11, 2017 5:28 AM

Re: Is there a way to do conditional group by in spark 2.1.1?

2017-06-10 Thread vaquar khan
Avoid groupby and use reducebykey. Regards, Vaquar khan On Jun 4, 2017 8:32 AM, "Guy Cohen" <g...@gettaxi.com> wrote: > Try this one: > > df.groupBy( > when(expr("field1='foo'"),"field1").when(expr("field2='bar'"),"field2&quo

Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-10 Thread vaquar khan
://spark.apache.org/docs/1.1.0/submitting-applications.html Also try to avoid function need memory like collect etc. Regards, Vaquar khan On Jun 4, 2017 5:46 AM, "Abdulfattah Safa" <fattah.s...@gmail.com> wrote: I'm working on Spark with Standalone Cluster mode. I need to increase t

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-10 Thread vaquar khan
Hi , Pleaae check your firewall security setting sharing link one good link. http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1 Regards, Vaquar khan On Jun 8, 2017 1:53 AM, "Patrik Medvedev" <patrik.medve...@gmail.com> wrote: > Hello guy

Re: Scala, Python or Java for Spark programming

2017-06-10 Thread vaquar khan
It's depends on programming style ,I would like to say setup few rules to avoid complex code in scala , if needed ask programmer to add proper comments. Regards, Vaquar khan On Jun 8, 2017 4:17 AM, "JB Data" <jbdat...@gmail.com> wrote: > Java is Object langage borned to D

Re: Read Data From NFS

2017-06-10 Thread vaquar khan
Hi Ayan, If you have multiple files (example 12 files )and you are using following code then you will get 12 partition. r = sc.textFile("file://my/file/*") Not sure what you want to know about file system ,please check API doc. Regards, Vaquar khan On Jun 8, 2017 10:44 AM,

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread vaquar khan
You can add filter or replace null with value like 0 or string. df.na.fill(0, Seq("y")) Regards, Vaquar khan On Jun 2, 2017 11:25 AM, "Alonso Isidoro Roman" <alons...@gmail.com> wrote: not sure if this can help you, but you can infer programmatically the schema pr

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread vaquar khan
HI , I found following two links are helpful sharing with you . http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join http://spark.apache.org/docs/latest/configuration.html Regards, Vaquar khan On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread vaquar khan
Please read Spark documents at least once before asking question. http://spark.apache.org/docs/latest/streaming-programming-guide.html http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/spark-streaming-datanami.png Regards, Vaquar khan On Fri, Mar 10, 2017

Re: Serialization error - sql UDF related

2017-02-17 Thread vaquar khan
/content/troubleshooting/javaionotserializableexception.html Regards, Vaquar khan On Fri, Feb 17, 2017 at 9:36 PM, Darshan Pandya <darshanpan...@gmail.com> wrote: > Hello, > > I am getting the famous serialization exception on running some code as > below, > > val

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread vaquar khan
Did you try MSCK REPAIR TABLE ? Regards, Vaquar Khan On Feb 6, 2017 11:21 AM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> wrote: > I dont think so, i was able to insert overwrite other created tables in > hive using spark sql. The only problem I am facing

Re: Cannot read Hive Views in Spark SQL

2017-02-05 Thread vaquar khan
Hi Ashmath, Try refresh table // spark is an existing SparkSession spark.catalog.refreshTable("my_table") http://spark.apache.org/docs/latest/sql-programming-guide.html#metadata-refreshing Regards, Vaquar khan On Sun, Feb 5, 2017 at 7:19 PM, KhajaAsmath Mohammed &l

Re: Time-Series Analysis with Spark

2017-01-11 Thread vaquar khan
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html Regards, Vaquar khan On Wed, Jan 11, 2017 at 10:07 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hello Rishabh, > We have done some forecasting, for time-series,

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread vaquar khan
Hi Deepak, Could you share Index information in your database. select * from indexInfo; Regards, Vaquar khan On Sat, Dec 17, 2016 at 2:45 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > How many workers are in the cluster? > > On Sat, Dec 17, 2016 at 12:23 PM Deepak S

Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread vaquar khan
Hi Kant, Hope following information will help . 1)Cluster https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-standalone.html http://spark.apache.org/docs/latest/hardware-provisioning.html 2) Yarn vs Mesos https://www.linkedin.com/pulse/mesos-compare-yarn-vaquar- khan

Re: Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread vaquar khan
That kind of issue SparkUI and DAG visualization always helpful. https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html Regards, Vaquar khan On Fri, Dec 16, 2016 at 11:10 AM, Vikas K. <vikas.re...@gmail.com> wrote: > Unsubscribe. &

Re: coalesce ending up very unbalanced - but why?

2016-12-16 Thread vaquar khan
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Regards, vaquar khan On Wed, Dec 14, 2016 at 12:15 PM, Vaibhav Sinha <mail.vsi...@gmail.com> wrote: > Hi, > I see a similar behaviour in an exactly similar scenario at my deployment > as w

Re: How to get recent value in spark dataframe

2016-12-16 Thread vaquar khan
Not sure about your logic 0 and 1 but you can use orderBy the data according to time and get the first value. Regards, Vaquar khan On Wed, Dec 14, 2016 at 10:49 PM, Milin korath <milin.kor...@impelsys.com> wrote: > Hi > > I have a spark data frame with following structure >

Re: Query in SparkSQL

2016-12-12 Thread vaquar khan
Hi Neeraj, As per my understanding Spark SQL doesn't support Update statements . Why you need update command in Spark SQL, You can run command in Hive . Regards, Vaquar khan On Mon, Dec 12, 2016 at 10:21 PM, Niraj Kumar <nku...@incedoinc.com> wrote: > Hi > > > > I am work

Re: Best practises around spark-scala

2016-08-08 Thread vaquar khan
I found following links are good as I am using same. http://spark.apache.org/docs/latest/tuning.html https://spark-summit.org/2014/testing-spark-best-practices/ Regards, Vaquar khan On 8 Aug 2016 10:11, "Deepak Sharma" <deepakmc...@gmail.com> wrote: > Hi All, > Can

Re: Spark Getting data from MongoDB in JAVA

2016-06-12 Thread vaquar khan
Hi Asfanyar, *NoSuchMethodError *in Java means you compiled against one version of code , and executed against a different version. Please make sure your java version and adding dependency version is working on same java version. regards, vaquar khan On Fri, Jun 10, 2016 at 4:50 AM, Asfandyar

Re: Questions about Spark Worker

2016-06-12 Thread vaquar khan
n “start-all.sh”, the Worker IP >> address become 127.0.0.1, and then I tried “ifconfig l0 down” and the >> Worker IP address become 127.0.1.1. >> >> What should I do to make IP use the IP address of the Ethernet instead of >> the address of the wireless? >> >> Thanks >> >> Jay >> >> >> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >> Windows 10 >> >> >> > > -- Regards, Vaquar Khan +91 830-851-1500

Re: OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread vaquar khan
Hi Sharad. The array size you (or the serializer) tries to allocate is just too big for the JVM. You can also split your input further by increasing parallelism. Following is good explanintion https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit regards, Vaquar khan

Re: oozie and spark on yarn

2016-06-08 Thread vaquar khan
/client/src/main/resources/spark-action-0.1.xsd regards, Vaquar Khan On Wed, Jun 8, 2016 at 5:26 AM, karthi keyan <karthi93.san...@gmail.com> wrote: > Hi , > > Make sure you have oozie 4.2.0 and configured with either yarn / mesos > mode. > > Well, you just parse your scala

Re: Spark_Usecase

2016-06-07 Thread vaquar khan
and Spark Streaming or do an incremental select to make sure your Spark SQL tables stay up to date with your production databases Regards, Vaquar khan On 7 Jun 2016 10:29, "Deepak Sharma" <deepakmc...@gmail.com> wrote: I am not sure if Spark provides any support for incremental ext

Re: Spark Interview Questions

2015-07-29 Thread vaquar khan
Hi Abhishek, Please learn spark ,there are no shortcuts for sucess. Regards, Vaquar khan On 29 Jul 2015 11:32, Mishra, Abhishek abhishek.mis...@xerox.com wrote: Hello, Please help me with links or some document for Apache Spark interview questions and answers. Also for the tools related

Re: Java 8 vs Scala

2015-07-15 Thread vaquar khan
My choice is java 8 On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote: On 15/07/2015 08:31, Ignacio Blasco wrote: The main advantage of using scala vs java 8 is being able to use a console https://bugs.openjdk.java.net/browse/JDK-8043364 -- Alan Burlison --

Re: Research ideas using spark

2015-07-15 Thread vaquar khan
I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote: Silly question… When thinking about a PhD thesis

Re: Spark Intro

2015-07-15 Thread vaquar khan
Totally agreed with hafasa, you need to identify your requirements and needs before choose spark. If you want to handle data with fast access go to no sql (mongo,aerospike etc) if you need data analytical then spark is best . Regards, Vaquar khan On 14 Jul 2015 20:39, Hafsa Asif hafsa.a

Re: Eclipse on spark

2015-01-26 Thread vaquar khan
I am using SBT On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote: I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33