Spark 3.0.0-preview and s3a

2019-12-12 Thread vincent gromakowski
Hi Spark users, I am testing the preview of Spark 3 with s3a and hadoop 3.2 but I have got NoClassDefFoundError and cannot find what is the issue. I suppose there is some lib conflict. Can someone provide a working configuration? *Exception in thread "main" java.lang.NoSuchMethodError:

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
t seem to be CPU-bound on the RDS db (we're using Hive metastore, > backed by AWS RDS). > > So my original question remains; does spark need to know about all > existing partitions for dynamic overwrite? I don't see why it would. > > On Thu, Apr 25, 2019 at 10:12 AM vincent gromakowsk

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
Which metastore are you using? Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit : > Would anyone be able to answer this question about the non-optimal > implementation of insertInto? > > On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote: > >> Hi, >> >> My job is writing ~10 partitions with

Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
ogramming-guide.html#recovery-semantics-after-changes-in-a-streaming-query > . > > On Tue, Dec 18, 2018 at 9:45 AM vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> Checkpointing is only used for failure recovery not for app upgrades. You >> need

Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
Checkpointing is only used for failure recovery not for app upgrades. You need to manually code the unload/load and save it to a persistent store Le mar. 18 déc. 2018 à 17:29, Priya Matpadi a écrit : > Using checkpointing for graceful updates is my understanding as well, > based on the writeup

Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread vincent gromakowski
On the other side increasing parallelism with kakfa partition avoid the shuffle in spark to repartition Le mer. 7 nov. 2018 à 09:51, Michael Shtelma a écrit : > If you configure to many Kafka partitions, you can run into memory issues. > This will increase memory requirements for spark job a

Re: External shuffle service on K8S

2018-10-26 Thread vincent gromakowski
No it's on the roadmap >2.4 Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit : > Hi all: > > Does Spark 2.3.2 supports external shuffle service on Kubernetes? > > I have looked up the documentation( > https://spark.apache.org/docs/latest/running-on-kubernetes.html), but > couldn't find related

Re: Use SparkContext in Web Application

2018-10-04 Thread vincent gromakowski
Decoupling the web app from Spark backend is recommended. Training the model can be launched in the background via a scheduling tool. Inferring the model with Spark in interactive mode s not a good option as it will do it for unitary data and Spark is better in using large dataset. The original

Re: Spark (Scala) Streaming [Convert rdd [org.bson.document] - > dataframe]

2018-07-18 Thread vincent gromakowski
Mongo libs provide a way to convert to case class On Wed 18 Jul 2018 at 20:23, Chethan wrote: > Hi Dev, > > I am streaming from mongoDB using Kafka with spark streaming, It returns > me document [org.bson.document] . I wan to convert this rdd to dataframe > to process with other data. > > Any

Re: Append In-Place to S3

2018-06-02 Thread vincent gromakowski
Structured streaming can provide idempotent and exactly once writings in parquet but I don't know how it does under the hood. Without this you need to load all your dataset, then dedup, then write back the entire dataset. This overhead can be minimized with partitionning output files. Le ven. 1

Re: Advice on multiple streaming job

2018-05-06 Thread vincent gromakowski
Use a scheduler that abstract the network away with a CNI for instance or other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always bind on the same ports because each container will have its own IP. Some other solution like mesos and marathon can work without CNI , with host IP

Re: Performance of Spark when the compute and storage are separated

2018-04-15 Thread vincent gromakowski
imed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 14 April 2018 at 21:06, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> Not with hadoop but with Cassandra, i have

Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Not with hadoop but with Cassandra, i have seen 20x data locality improvement on partitioned optimized spark jobs Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh a écrit : > Hi, > > This is a sort of your mileage varies type question. > > In a classic Hadoop cluster,

Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-21 Thread vincent gromakowski
If not resilient at spark level, can't you just relaunch you job with your orchestration tool ? Le 21 déc. 2017 09:34, "Georg Heiler" a écrit : > Die you try to use the yarn Shuffle Service? > chopinxb schrieb am Do. 21. Dez. 2017 um 04:43: > >>

Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-20 Thread vincent gromakowski
Probability of a complete node failure is low. I would rely on data lineage and accept the reprocessing overhead. Another option would be to Write on distributed FS but it will drastically reduce all your jobs speed Le 20 déc. 2017 11:23, "chopinxb" a écrit : > Yes,shuffle

Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-20 Thread vincent gromakowski
In your case you need to externalize the shuffle files to a component outside of your spark cluster to make them persist after spark workers death. https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service 2017-12-20 10:46 GMT+01:00 chopinxb

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread vincent gromakowski
What about chaining with akka or akka stream and the fair scheduler ? Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit : Hi Michael, I am wondering what I am doing wrong. I get error like: Exception in thread "main" java.lang.IllegalArgumentException: Schema must be

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vincent gromakowski
I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...) Spark is better when processing group of rows (group by, ml, window func...) Le 11 juin 2017 8:15 PM, "yohann jardin" a écrit : Hey, Kafka can

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 1 June 2017 at 08:55, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> I don't recommend this kind of design because you loose phys

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
I don't recommend this kind of design because you loose physical data locality and you will be affected by "bad neighboors" that are also using the network storage... We have one similar design but restricted to small clusters (more for experiments than production) 2017-06-01 9:47 GMT+02:00 Mich

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread vincent gromakowski
Akka has been replaced by netty in 1.6 Le 22 mai 2017 15:25, "Chin Wei Low" a écrit : > I think akka has been removed since 2.0. > > On 22 May 2017 10:19 pm, "Gene Pang" wrote: > >> Hi, >> >> Tachyon has been renamed to Alluxio. Here is the

Re: Restful API Spark Application

2017-05-13 Thread vincent gromakowski
It's in scala but it should be portable in java https://github.com/vgkowski/akka-spark-experiments Le 12 mai 2017 10:54 PM, "Василец Дмитрий" a écrit : and livy https://hortonworks.com/blog/livy-a-rest-interface-for- apache-spark/ On Fri, May 12, 2017 at 10:51 PM,

Re: Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread vincent gromakowski
Use cache or persist. The dataframe will be materialized when the 1st action is called and then be reused from memory for each following usage Le 1 mai 2017 4:51 PM, "Saulo Ricci" a écrit : > Hi, > > > I have the following code that is reading a table to a apache spark >

Re: help/suggestions to setup spark cluster

2017-04-27 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe you can try fair scheduling and use one cluster but i doubt it will be prod ready... Le 27 avr. 2017 5:28 AM, "anna stax" a écrit : > Thanks Cody, > > As I already mentioned I am running spark

spark streaming resiliency

2017-04-25 Thread vincent gromakowski
Hi, I have a question regarding Spark streaming resiliency and the documentation is ambiguous : The documentation says that the default configuration use a replication factor of 2 for data received but the recommendation is to use write ahead logs to guarantee data resiliency with receivers.

Re: Authorizations in thriftserver

2017-04-25 Thread vincent gromakowski
Does someone have the answer ? 2017-04-24 9:32 GMT+02:00 vincent gromakowski <vincent.gromakow...@gmail.com >: > Hi, > Can someone confirm authorizations aren't implemented in Spark > thriftserver for SQL standard based hive authorizations? > https://cwiki.apache.org/confluenc

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread vincent gromakowski
; > *From: *Gene Pang <gene.p...@gmail.com> > *Date: *Monday, 24 April 2017 at 16.41 > *To: *vincent gromakowski <vincent.gromakow...@gmail.com> > *Cc: *Hemanth Gudela <hemanth.gud...@qvantel.com>, "user@spark.apache.org" > <user@spark.apache.org>, Felix Che

Authorizations in thriftserver

2017-04-24 Thread vincent gromakowski
Hi, Can someone confirm authorizations aren't implemented in Spark thriftserver for SQL standard based hive authorizations? https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization If confirmed, any plan to implement it ? Thanks

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-22 Thread vincent gromakowski
Look at alluxio for sharing across drivers or spark jobserver Le 22 avr. 2017 10:24 AM, "Hemanth Gudela" a écrit : > Thanks for your reply. > > > > Creating a table is an option, but such approach slows down reads & writes > for a real-time analytics streaming use

Re: Reading ASN.1 files in Spark

2017-04-06 Thread vincent gromakowski
I would also be interested... 2017-04-06 11:09 GMT+02:00 Hamza HACHANI : > Does any body have a spark code example where he is reading ASN.1 files ? > Thx > > Best regards > Hamza >

data cleaning and error routing

2017-03-21 Thread vincent gromakowski
Hi, In a context of ugly data, I am trying to find an efficient way to parse a kafka stream of CSV lines into a clean data model and route lines in error in a specific topic. Generally I do this: 1. First a map to split my lines with the separator character (";") 2. Then a filter where I put all

Re: Fast write datastore...

2017-03-15 Thread vincent gromakowski
Hi If queries are statics and filters are on the same columns, Cassandra is a good option. Le 15 mars 2017 7:04 AM, "muthu" a écrit : Hello there, I have one or more parquet files to read and perform some aggregate queries using Spark Dataframe. I would like to find a

Re: Scaling Kafka Direct Streming application

2017-03-15 Thread vincent gromakowski
You would probably need dynamic allocation which is only available on yarn and mesos. Or wait for on going spark k8s integration Le 15 mars 2017 1:54 AM, "Pranav Shukla" a écrit : > How to scale or possibly auto-scale a spark streaming application > consuming from

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
I forgot to mention it also depends on the spark kafka connector you use. If it's receiver based, I recommend a dedicated zookeeper cluster because it is used to store offsets. If it's receiver less Zookeeper can be shared. 2017-03-03 9:29 GMT+01:00 Jörn Franke : > I think

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
Hi, Depending on the Kafka version (< 0.8.2 I think), offsets are managed in Zookeeper and if you have lots of consumer it's recommended to use a dedicated zookeeper cluster (always with dedicated disks, even SSD is better). On newer version offsets are managed in special Kafka topics and

Re: is dataframe thread safe?

2017-02-15 Thread vincent gromakowski
com>: > updating dataframe returns NEW dataframe like RDD please? > > ---Original--- > *From:* "vincent gromakowski"<vincent.gromakow...@gmail.com> > *Date:* 2017/2/14 01:15:35 > *To:* "Reynold Xin"<r...@databricks.com>; > *Cc:* "

Re: is dataframe thread safe?

2017-02-13 Thread vincent gromakowski
How about having a thread that update and cache a dataframe in-memory next to other threads requesting this dataframe, is it thread safe ? 2017-02-13 9:02 GMT+01:00 Reynold Xin : > Yes your use case should be fine. Multiple threads can transform the same > data frame in

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread vincent gromakowski
Spark jobserver or Livy server are the best options for pure technical API. If you want to publish business API you will probably have to build you own app like the one I wrote a year ago https://github.com/elppc/akka-spark-experiments It combines Akka actors and a shared Spark context to serve

RE: filters Pushdown

2017-02-02 Thread vincent gromakowski
écrit : > Hi Vincent, > > Thank you for answer. (I don’t see your answer in mailing list, so I’m > answering directly) > > > > What connectors can I work with from Spark? > > Can you provide any link to read about it because I see nothing in Spark >

Re: filters Pushdown

2017-02-02 Thread vincent gromakowski
Pushdowns depend on the source connector. Join pushdown with Cassandra only Filter pushdown with mainly all sources with some specific constraints Le 2 févr. 2017 10:42 AM, "Peter Sg" a écrit : > Can community help me to figure out some details about Spark: > - Does

Re: Having multiple spark context

2017-01-29 Thread vincent gromakowski
A clustering lib is necessary to manage multiple jvm. Akka cluster for instance Le 30 janv. 2017 8:01 AM, "Rohit Verma" a écrit : > Hi, > > If I am right, you need to launch other context from another jvm. If you > are trying to launch from same jvm another context it

Re: spark locality

2017-01-14 Thread vincent gromakowski
1-12 21:13 GMT+01:00 Michael Gummelt <mgumm...@mesosphere.io>: >> >>> The code in there w/ docs that reference CNI doesn't actually run when >>> CNI is in effect, and doesn't have anything to do with locality. It's just >>> making Spark work in a no-DNS environmen

Re: spark locality

2017-01-12 Thread vincent gromakowski
ually run when CNI > is in effect, and doesn't have anything to do with locality. It's just > making Spark work in a no-DNS environment > > On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> I have found this but I

Re: spark locality

2017-01-12 Thread vincent gromakowski
sn't great right now anyway. Executors are > placed w/o regard to locality. Locality is only taken into account when > tasks are assigned to executors. So if you get a locality-poor executor > placement, you'll also have locality poor task placement. It could be > better. >

spark locality

2017-01-12 Thread vincent gromakowski
Hi all, Does anyone have experience running Spark on Mesos with CNI (ip per container) ? How would Spark use IP or hostname for data locality with backend framework like HDFS or Cassandra ? V

Re: Question about Spark and filesystems

2016-12-18 Thread vincent gromakowski
I am using gluster and i have decent performance with basic maintenance effort. Advantage of gluster: you can plug Alluxio on top to improve perf but I still need to be validate... Le 18 déc. 2016 8:50 PM, a écrit : > Hello, > > We are trying out Spark for some file processing

Re: Is there synchronous way to predict against model for real time data

2016-12-15 Thread vincent gromakowski
Something like that ? https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/ Le 16 déc. 2016 1:08 AM, "suyogchoudh...@gmail.com" < suyogchoudh...@gmail.com> a écrit : > Hi, > > I have question about, how can I real time make decision using a model I > have

Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
What about ephemeral storage on ssd ? If performance is required it's generally for production so the cluster would never be stopped. Then a spark job to backup/restore on S3 allows to shut down completely the cluster Le 3 déc. 2016 1:28 PM, "David Mitchell" a écrit :

Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
You get more latency on reads so overall execution time is longer Le 3 déc. 2016 7:39 AM, "kant kodali" a écrit : > > I wonder what benefits do I really I get If I colocate my spark worker > process and Cassandra server process on each node? > > I understand the concept of

Re: Do I have to wrap akka around spark streaming app?

2016-11-29 Thread vincent gromakowski
rote: > >> In this case, persisting to Cassandra is for future analytics and >> Visualization. >> >> I want to notify that the app of the event, so it makes the app >> interactive. >> >> Thanks >> >> On Mon, Nov 28, 2016 at 2:24 PM, vincent gromakowski &

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
So not sure what you mean by kafka offsets will do the job, how will the > akka consumer know the kafka offset? > > On Mon, Nov 28, 2016 at 12:52 PM, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> You don't need actors to do kafka=>spark processing=>

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
You don't need actors to do kafka=>spark processing=>kafka Why do you need to notify the akka producer ? If you need to get back the processed message in your producer, then implement an akka consummer in your akka app and kafka offsets will do the job 2016-11-28 21:46 GMT+01:00 shyla deshpande

Re: Possible DR solution

2016-11-12 Thread vincent gromakowski
A Hdfs tiering policy with good tags should be similar Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" a écrit : > I really don't see why one wants to set up streaming replication unless > for situations where similar functionality to transactional databases is > required

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread vincent gromakowski
I have already integrated common actors. I am also interested, specially to see how we can achieve end to end back pressure. 2016-11-10 8:46 GMT+01:00 shyla deshpande : > I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, > Spark Streaming and

Re: A Spark long running program as web server

2016-11-06 Thread vincent gromakowski
Hi, Spark jobserver seems to be more mature than Livy but both would work I think. You will just get more functionalities with the jobserver except the impersonation that is only in Livy. If you need to publish business API I would recommend to use Akka http with Spark actors sharing a preloaded

Re: How to avoid unnecessary spark starkups on every request?

2016-11-02 Thread vincent gromakowski
Hi I am currently using akka http sending requests to multiple spark actors that use a preloaded spark context and fair scheduler. It's only a prototype and I haven't tested the concurrency but it seems one of the rigth way to do. Complete processing time is arround 600 ms.The other way would be

Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
Bad idea. No caching, cluster over consumption... Have a look on instantiating a custom thriftserver on temp tables with fair scheduler to allow concurrent SQL requests. It's not a public API but you can find some examples. Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh"

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi Just point all users on the same app with a common spark context. For instance akka http receives queries from user and launch concurrent spark SQL queries in different actor thread. The only prerequsite is to launch the different jobs in different threads (like with actors). Be carefull it's

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
t; loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 Octob

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
ead safe RDD sharing between spark jobs ==> these are best for sharing between users 2016-10-27 12:59 GMT+02:00 vincent gromakowski < vincent.gromakow...@gmail.com>: > I would prefer sharing the spark context and using FAIR scheduler for > user concurrency > > Le 27 oct. 2016 12

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
om > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetar

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5 Ignite leverage indexes Alluxio provides tiering Alluxio easily integrates with underlying FS Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" a écrit : > Thanks Chanh, > > Can it share RDDs. > > Personally I have not used either Alluxio or

Re: Microbatches length

2016-10-20 Thread vincent gromakowski
You can still implement your own logic with akka actors for instance. Based on some threshold the actor can launch spark batch mode using the same spark context... It's only an idea , no real experience. Le 20 oct. 2016 1:31 PM, "Paulo Candido" a écrit : > In this case I

Re: mllib model in production web API

2016-10-18 Thread vincent gromakowski
Hi Did you try applying the model with akka instead of spark ? https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/ Le 18 oct. 2016 5:58 AM, "Aseem Bansal" a écrit : > @Nicolas > > No, ours is different. We required predictions within

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
> Ben > > > > On Oct 17, 2016, at 10:14 AM, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > > I would suggest to code your own Spark thriftserver which seems to be very > easy. > http://stackoverflow.com/questions/27108863/accessing- > spark-sql

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
I would suggest to code your own Spark thriftserver which seems to be very easy. http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then

spark on mesos memory sizing with offheap

2016-10-13 Thread vincent gromakowski
Hi, I am trying to understand how mesos allocate memory when offheap is enabled but it seems that the framework is only taking the heap + 400 MB overhead into consideration for resources allocation. Example: spark.executor.memory=3g spark.memory.offheap.size=1g ==> mesos report 3.4g allocated for