Hi Spark users,
I am testing the preview of Spark 3 with s3a and hadoop 3.2 but I have got
NoClassDefFoundError and cannot find what is the issue. I suppose there is
some lib conflict. Can someone provide a working configuration?
*Exception in thread "main" java.lang.NoSuchMethodError:
t seem to be CPU-bound on the RDS db (we're using Hive metastore,
> backed by AWS RDS).
>
> So my original question remains; does spark need to know about all
> existing partitions for dynamic overwrite? I don't see why it would.
>
> On Thu, Apr 25, 2019 at 10:12 AM vincent gromakowsk
Which metastore are you using?
Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit :
> Would anyone be able to answer this question about the non-optimal
> implementation of insertInto?
>
> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote:
>
>> Hi,
>>
>> My job is writing ~10 partitions with
ogramming-guide.html#recovery-semantics-after-changes-in-a-streaming-query
> .
>
> On Tue, Dec 18, 2018 at 9:45 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Checkpointing is only used for failure recovery not for app upgrades. You
>> need
Checkpointing is only used for failure recovery not for app upgrades. You
need to manually code the unload/load and save it to a persistent store
Le mar. 18 déc. 2018 à 17:29, Priya Matpadi a écrit :
> Using checkpointing for graceful updates is my understanding as well,
> based on the writeup
On the other side increasing parallelism with kakfa partition avoid the
shuffle in spark to repartition
Le mer. 7 nov. 2018 à 09:51, Michael Shtelma a écrit :
> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a
No it's on the roadmap >2.4
Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit :
> Hi all:
>
> Does Spark 2.3.2 supports external shuffle service on Kubernetes?
>
> I have looked up the documentation(
> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
> couldn't find related
Decoupling the web app from Spark backend is recommended. Training the
model can be launched in the background via a scheduling tool. Inferring
the model with Spark in interactive mode s not a good option as it will do
it for unitary data and Spark is better in using large dataset. The
original
Mongo libs provide a way to convert to case class
On Wed 18 Jul 2018 at 20:23, Chethan wrote:
> Hi Dev,
>
> I am streaming from mongoDB using Kafka with spark streaming, It returns
> me document [org.bson.document] . I wan to convert this rdd to dataframe
> to process with other data.
>
> Any
Structured streaming can provide idempotent and exactly once writings in
parquet but I don't know how it does under the hood.
Without this you need to load all your dataset, then dedup, then write back
the entire dataset. This overhead can be minimized with partitionning
output files.
Le ven. 1
Use a scheduler that abstract the network away with a CNI for instance or
other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
bind on the same ports because each container will have its own IP. Some
other solution like mesos and marathon can work without CNI , with host IP
imed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 April 2018 at 21:06, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Not with hadoop but with Cassandra, i have
Not with hadoop but with Cassandra, i have seen 20x data locality
improvement on partitioned optimized spark jobs
Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh a
écrit :
> Hi,
>
> This is a sort of your mileage varies type question.
>
> In a classic Hadoop cluster,
If not resilient at spark level, can't you just relaunch you job with your
orchestration tool ?
Le 21 déc. 2017 09:34, "Georg Heiler" a écrit :
> Die you try to use the yarn Shuffle Service?
> chopinxb schrieb am Do. 21. Dez. 2017 um 04:43:
>
>>
Probability of a complete node failure is low. I would rely on data lineage
and accept the reprocessing overhead. Another option would be to Write on
distributed FS but it will drastically reduce all your jobs speed
Le 20 déc. 2017 11:23, "chopinxb" a écrit :
> Yes,shuffle
In your case you need to externalize the shuffle files to a component
outside of your spark cluster to make them persist after spark workers
death.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
2017-12-20 10:46 GMT+01:00 chopinxb
What about chaining with akka or akka stream and the fair scheduler ?
Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit :
Hi Michael,
I am wondering what I am doing wrong. I get error like:
Exception in thread "main" java.lang.IllegalArgumentException: Schema must
be
I think Kafka streams is good when the processing of each row is
independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)
Le 11 juin 2017 8:15 PM, "yohann jardin" a
écrit :
Hey,
Kafka can
author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:55, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> I don't recommend this kind of design because you loose phys
I don't recommend this kind of design because you loose physical data
locality and you will be affected by "bad neighboors" that are also using
the network storage... We have one similar design but restricted to small
clusters (more for experiments than production)
2017-06-01 9:47 GMT+02:00 Mich
Akka has been replaced by netty in 1.6
Le 22 mai 2017 15:25, "Chin Wei Low" a écrit :
> I think akka has been removed since 2.0.
>
> On 22 May 2017 10:19 pm, "Gene Pang" wrote:
>
>> Hi,
>>
>> Tachyon has been renamed to Alluxio. Here is the
It's in scala but it should be portable in java
https://github.com/vgkowski/akka-spark-experiments
Le 12 mai 2017 10:54 PM, "Василец Дмитрий" a
écrit :
and livy https://hortonworks.com/blog/livy-a-rest-interface-for-
apache-spark/
On Fri, May 12, 2017 at 10:51 PM,
Use cache or persist. The dataframe will be materialized when the 1st
action is called and then be reused from memory for each following usage
Le 1 mai 2017 4:51 PM, "Saulo Ricci" a écrit :
> Hi,
>
>
> I have the following code that is reading a table to a apache spark
>
Spark standalone is not multi tenant you need one clusters per job. Maybe
you can try fair scheduling and use one cluster but i doubt it will be prod
ready...
Le 27 avr. 2017 5:28 AM, "anna stax" a écrit :
> Thanks Cody,
>
> As I already mentioned I am running spark
Hi,
I have a question regarding Spark streaming resiliency and the
documentation is ambiguous :
The documentation says that the default configuration use a replication
factor of 2 for data received but the recommendation is to use write ahead
logs to guarantee data resiliency with receivers.
Does someone have the answer ?
2017-04-24 9:32 GMT+02:00 vincent gromakowski <vincent.gromakow...@gmail.com
>:
> Hi,
> Can someone confirm authorizations aren't implemented in Spark
> thriftserver for SQL standard based hive authorizations?
> https://cwiki.apache.org/confluenc
;
> *From: *Gene Pang <gene.p...@gmail.com>
> *Date: *Monday, 24 April 2017 at 16.41
> *To: *vincent gromakowski <vincent.gromakow...@gmail.com>
> *Cc: *Hemanth Gudela <hemanth.gud...@qvantel.com>, "user@spark.apache.org"
> <user@spark.apache.org>, Felix Che
Hi,
Can someone confirm authorizations aren't implemented in Spark thriftserver
for SQL standard based hive authorizations?
https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
If confirmed, any plan to implement it ?
Thanks
Look at alluxio for sharing across drivers or spark jobserver
Le 22 avr. 2017 10:24 AM, "Hemanth Gudela" a
écrit :
> Thanks for your reply.
>
>
>
> Creating a table is an option, but such approach slows down reads & writes
> for a real-time analytics streaming use
I would also be interested...
2017-04-06 11:09 GMT+02:00 Hamza HACHANI :
> Does any body have a spark code example where he is reading ASN.1 files ?
> Thx
>
> Best regards
> Hamza
>
Hi,
In a context of ugly data, I am trying to find an efficient way to parse a
kafka stream of CSV lines into a clean data model and route lines in error
in a specific topic.
Generally I do this:
1. First a map to split my lines with the separator character (";")
2. Then a filter where I put all
Hi
If queries are statics and filters are on the same columns, Cassandra is a
good option.
Le 15 mars 2017 7:04 AM, "muthu" a écrit :
Hello there,
I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a
You would probably need dynamic allocation which is only available on yarn
and mesos. Or wait for on going spark k8s integration
Le 15 mars 2017 1:54 AM, "Pranav Shukla" a
écrit :
> How to scale or possibly auto-scale a spark streaming application
> consuming from
I forgot to mention it also depends on the spark kafka connector you use.
If it's receiver based, I recommend a dedicated zookeeper cluster because
it is used to store offsets. If it's receiver less Zookeeper can be shared.
2017-03-03 9:29 GMT+01:00 Jörn Franke :
> I think
Hi,
Depending on the Kafka version (< 0.8.2 I think), offsets are managed in
Zookeeper and if you have lots of consumer it's recommended to use a
dedicated zookeeper cluster (always with dedicated disks, even SSD is
better). On newer version offsets are managed in special Kafka topics and
com>:
> updating dataframe returns NEW dataframe like RDD please?
>
> ---Original---
> *From:* "vincent gromakowski"<vincent.gromakow...@gmail.com>
> *Date:* 2017/2/14 01:15:35
> *To:* "Reynold Xin"<r...@databricks.com>;
> *Cc:* "
How about having a thread that update and cache a dataframe in-memory next
to other threads requesting this dataframe, is it thread safe ?
2017-02-13 9:02 GMT+01:00 Reynold Xin :
> Yes your use case should be fine. Multiple threads can transform the same
> data frame in
Spark jobserver or Livy server are the best options for pure technical API.
If you want to publish business API you will probably have to build you own
app like the one I wrote a year ago
https://github.com/elppc/akka-spark-experiments
It combines Akka actors and a shared Spark context to serve
écrit :
> Hi Vincent,
>
> Thank you for answer. (I don’t see your answer in mailing list, so I’m
> answering directly)
>
>
>
> What connectors can I work with from Spark?
>
> Can you provide any link to read about it because I see nothing in Spark
>
Pushdowns depend on the source connector.
Join pushdown with Cassandra only
Filter pushdown with mainly all sources with some specific constraints
Le 2 févr. 2017 10:42 AM, "Peter Sg" a écrit :
> Can community help me to figure out some details about Spark:
> - Does
A clustering lib is necessary to manage multiple jvm. Akka cluster for
instance
Le 30 janv. 2017 8:01 AM, "Rohit Verma" a
écrit :
> Hi,
>
> If I am right, you need to launch other context from another jvm. If you
> are trying to launch from same jvm another context it
1-12 21:13 GMT+01:00 Michael Gummelt <mgumm...@mesosphere.io>:
>>
>>> The code in there w/ docs that reference CNI doesn't actually run when
>>> CNI is in effect, and doesn't have anything to do with locality. It's just
>>> making Spark work in a no-DNS environmen
ually run when CNI
> is in effect, and doesn't have anything to do with locality. It's just
> making Spark work in a no-DNS environment
>
> On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> I have found this but I
sn't great right now anyway. Executors are
> placed w/o regard to locality. Locality is only taken into account when
> tasks are assigned to executors. So if you get a locality-poor executor
> placement, you'll also have locality poor task placement. It could be
> better.
>
Hi all,
Does anyone have experience running Spark on Mesos with CNI (ip per
container) ?
How would Spark use IP or hostname for data locality with backend framework
like HDFS or Cassandra ?
V
I am using gluster and i have decent performance with basic maintenance
effort. Advantage of gluster: you can plug Alluxio on top to improve perf
but I still need to be validate...
Le 18 déc. 2016 8:50 PM, a écrit :
> Hello,
>
> We are trying out Spark for some file processing
Something like that ?
https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/
Le 16 déc. 2016 1:08 AM, "suyogchoudh...@gmail.com" <
suyogchoudh...@gmail.com> a écrit :
> Hi,
>
> I have question about, how can I real time make decision using a model I
> have
What about ephemeral storage on ssd ? If performance is required it's
generally for production so the cluster would never be stopped. Then a
spark job to backup/restore on S3 allows to shut down completely the cluster
Le 3 déc. 2016 1:28 PM, "David Mitchell" a
écrit :
You get more latency on reads so overall execution time is longer
Le 3 déc. 2016 7:39 AM, "kant kodali" a écrit :
>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of
rote:
>
>> In this case, persisting to Cassandra is for future analytics and
>> Visualization.
>>
>> I want to notify that the app of the event, so it makes the app
>> interactive.
>>
>> Thanks
>>
>> On Mon, Nov 28, 2016 at 2:24 PM, vincent gromakowski &
So not sure what you mean by kafka offsets will do the job, how will the
> akka consumer know the kafka offset?
>
> On Mon, Nov 28, 2016 at 12:52 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> You don't need actors to do kafka=>spark processing=>
You don't need actors to do kafka=>spark processing=>kafka
Why do you need to notify the akka producer ? If you need to get back the
processed message in your producer, then implement an akka consummer in
your akka app and kafka offsets will do the job
2016-11-28 21:46 GMT+01:00 shyla deshpande
A Hdfs tiering policy with good tags should be similar
Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" a
écrit :
> I really don't see why one wants to set up streaming replication unless
> for situations where similar functionality to transactional databases is
> required
I have already integrated common actors. I am also interested, specially to
see how we can achieve end to end back pressure.
2016-11-10 8:46 GMT+01:00 shyla deshpande :
> I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka,
> Spark Streaming and
Hi,
Spark jobserver seems to be more mature than Livy but both would work I
think. You will just get more functionalities with the jobserver except the
impersonation that is only in Livy.
If you need to publish business API I would recommend to use Akka http with
Spark actors sharing a preloaded
Hi
I am currently using akka http sending requests to multiple spark actors
that use a preloaded spark context and fair scheduler. It's only a
prototype and I haven't tested the concurrency but it seems one of the
rigth way to do. Complete processing time is arround 600 ms.The other way
would be
Bad idea. No caching, cluster over consumption...
Have a look on instantiating a custom thriftserver on temp tables with
fair scheduler to allow concurrent SQL requests. It's not a public API but
you can find some examples.
Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh"
Hi
Just point all users on the same app with a common spark context.
For instance akka http receives queries from user and launch concurrent
spark SQL queries in different actor thread. The only prerequsite is to
launch the different jobs in different threads (like with actors).
Be carefull it's
t; loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 Octob
ead safe RDD sharing between spark
jobs
==> these are best for sharing between users
2016-10-27 12:59 GMT+02:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:
> I would prefer sharing the spark context and using FAIR scheduler for
> user concurrency
>
> Le 27 oct. 2016 12
om
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetar
Ignite works only with spark 1.5
Ignite leverage indexes
Alluxio provides tiering
Alluxio easily integrates with underlying FS
Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" a
écrit :
> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or
You can still implement your own logic with akka actors for instance. Based
on some threshold the actor can launch spark batch mode using the same
spark context... It's only an idea , no real experience.
Le 20 oct. 2016 1:31 PM, "Paulo Candido" a écrit :
> In this case I
Hi
Did you try applying the model with akka instead of spark ?
https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/
Le 18 oct. 2016 5:58 AM, "Aseem Bansal" a écrit :
> @Nicolas
>
> No, ours is different. We required predictions within
> Ben
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql
I would suggest to code your own Spark thriftserver which seems to be very
easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
I am starting to test it. The big advantage is that you can implement any
logic because it's a spark job and then
Hi,
I am trying to understand how mesos allocate memory when offheap is enabled
but it seems that the framework is only taking the heap + 400 MB overhead
into consideration for resources allocation.
Example: spark.executor.memory=3g spark.memory.offheap.size=1g ==> mesos
report 3.4g allocated for
67 matches
Mail list logo