Re: Structured Streaming & Query Planning

2019-03-18 Thread Paolo Platter
rch 14, 2019 6:59:50 PM To: Paolo Platter Cc: user@spark.apache.org Subject: Re: Structured Streaming & Query Planning Hello Paolo, generally speaking, query planning is mostly based on statistics and distributions of data values for the involved columns, which might significantly change

Structured Streaming & Query Planning

2019-03-14 Thread Paolo Platter
to disable or cache it. Thanks [cid:image001.jpg@01D41D15.E01B6F00] Paolo Platter CTO E-mail:paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it> Web Site: www.agilelab.it<http://www.agilelab.it/>

R: How to reissue a delegated token after max lifetime passes for a spark streaming application on a Kerberized cluster

2019-01-03 Thread Paolo Platter
with hdfs_delegation_token but is NOT working with “kms-dt”. Anyone knows why this is happening ? Any suggestion to make it working with KMS ? Thanks [cid:image001.jpg@01D41D15.E01B6F00] Paolo Platter CTO E-mail:paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it> We

R: Tungsten and Spark Streaming

2015-09-10 Thread Paolo Platter
Did you plan to modify dstream interface in order to work with dataframe ? It would be nice handle dstreams without generics Paolo Inviata dal mio Windows Phone Da: Tathagata Das Inviato: ‎10/‎09/‎2015 07:42 A: N

R: Spark + Druid

2015-09-02 Thread Paolo Platter
Fantastic!!! I will look into that and I hope to contribute Paolo Inviata dal mio Windows Phone Da: Harish Butani Inviato: ‎02/‎09/‎2015 06:04 A: user Oggetto: Spark + Druid Hi, I am working on the

R: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-29 Thread Paolo Platter
Try to give a look at zoomdata. They are spark based and they offer BI features with good performance. Paolo Inviata dal mio Windows Phone Da: Ruslan Dautkhanovmailto:dautkha...@gmail.com Inviato: ‎29/‎07/‎2015 06:18 A: renga.kannanmailto:renga.kan...@gmail.com

R: Spark is much slower than direct access MySQL

2015-07-26 Thread Paolo Platter
If you want a performance boost, you need to load the full table in memory using caching and them execute your query directly on cached dataframe. Otherwise you use spark only as a bridge and you don't leverage the distributed in memory engine of spark. Paolo Inviata dal mio Windows Phone

R: Is spark suitable for real time query

2015-07-22 Thread Paolo Platter
Are you using jdbc server? Paolo Inviata dal mio Windows Phone Da: Louis Hustmailto:louis.h...@gmail.com Inviato: ‎22/‎07/‎2015 13:47 A: Robin Eastmailto:robin.e...@xense.co.uk Cc: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: Re: Is spark suitable

Scripting with groovy

2015-06-02 Thread Paolo Platter
Hi all, Has anyone tried to add Scripting capabilities to spark streaming using groovy? I would like to stop the streaming context, update a transformation function written in groovy( for example to manipulate json ), restart the streaming context and obtain a new behavior without re-submit the

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Paolo Platter
Nice Job! we are developing something very similar... I will contact you to understand if we can contribute to you with some piece ! Best Paolo Da: Evo Eftimovmailto:evo.efti...@isecc.com Data invio: ?gioved?? ?14? ?maggio? ?2015 ?17?:?21 A: 'David Morales'mailto:dmora...@stratio.com, Matei

SparkSQL + Parquet performance

2015-04-06 Thread Paolo Platter
Hi all, is there anyone using SparkSQL + Parquet that has made a benchmark about storing parquet files on HDFS or on CFS ( Cassandra File System )? What storage can improve performance of SparkSQL+ Parquet ? Thanks Paolo

Spark Druid integration

2015-04-06 Thread Paolo Platter
Hi, Do you think it is possible to build an integration beetween druid and spark, using Datasource API ? Is someone investigating this kind of solution ? I think that Spark SQL could fill the lack of a complete SQL Layer of Druid. It could be a great OLAP solution. WDYT ? Paolo Platter

R: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread Paolo Platter
Yes, I would suggest spark-notebook too. It's very simple to setup and it's growing pretty fast. Paolo Inviata dal mio Windows Phone Da: Irfan Ahmadmailto:ir...@cloudphysics.com Inviato: ‎19/‎03/‎2015 04:05 A: davidhmailto:dav...@annaisystems.com Cc:

Re: Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
I was speaking about 1.2 version of spark Paolo Da: Paolo Plattermailto:paolo.plat...@agilelab.it Data invio: ?luned?? ?23? ?febbraio? ?2015 ?10?:?41 A: user@spark.apache.orgmailto:user@spark.apache.org Hi guys, Is the IN operator supported in Spark SQL over Hive Metastore ? Thanks Paolo

Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
Hi guys, Is the “IN” operator supported in Spark SQL over Hive Metastore ? Thanks Paolo

SparkSQL and star schema

2015-02-13 Thread Paolo Platter
Hi, is SparkSQL + Parquet suitable to replicate a star schema ? Paolo Platter AgileLab CTO

R: Datastore HDFS vs Cassandra

2015-02-10 Thread Paolo Platter
Hi Mike, I developed a Solution with cassandra and spark, using DSE. The main difficult is about cassandra, you need to understand very well its data model and its Query patterns. Cassandra has better performance than hdfs and it has DR and stronger availability. Hdfs is a filesystem, cassandra

R: spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-08 Thread Paolo Platter
Could anyone figure out what is going in my spark cluster? Thanks in advance Paolo Inviata dal mio Windows Phone Da: Paolo Plattermailto:paolo.plat...@agilelab.it Inviato: ‎06/‎02/‎2015 10:48 A: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: spark

spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-06 Thread Paolo Platter
Hi all, I’m experiencing a strange behaviour of spark 1.2. I’ve a 3 node cluster + the master. each node has: 1 HDD 7200 rpm 1 TB 16 GB RAM 8 core I configured executors with 6 cores and 10 GB each ( spark.storage.memoryFraction = 0.6 ) My job is pretty simple: val file1 =

R: Broadcast variables: when should I use them?

2015-01-26 Thread Paolo Platter
Hi, Yes, if they are not big, it's a good practice to broadcast them to avoid serializing them each time you use clojure. Paolo Inviata dal mio Windows Phone Da: frodo777mailto:roberto.vaquer...@bitmonlab.com Inviato: ‎26/‎01/‎2015 14:34 A:

R: RDD Moving Average

2015-01-06 Thread Paolo Platter
In my opinion you should use fold pattern. Obviously after an sort by trasformation. Paolo Inviata dal mio Windows Phone Da: Asim Jalismailto:asimja...@gmail.com Inviato: ‎06/‎01/‎2015 23:11 A: Sean Owenmailto:so...@cloudera.com Cc:

R: Clarifications on Spark

2014-12-05 Thread Paolo Platter
Hi, 1) yes you can. Spark is supporting a lot of file formats on hdfs/s3 then is supporting cassandra and jdbc in General. 2) yes. Spark has a jdbc thrift server where you can attach BI tools. I suggest to you to pay attention to your Query response time requirements. 3) no you can go with

R: Optimized spark configuration

2014-12-05 Thread Paolo Platter
What kind of Query are you performing? You should set something like 2 partition per core that would be 400 Mb per partition. As you have a lot of ram I suggest to cache the whole table, performance will increase a lot. Paolo Inviata dal mio Windows Phone Da:

R: map function

2014-12-04 Thread Paolo Platter
Hi, rdd.flatMap( e = e._2.map( i = ( i, e._1))) Should work, but I didn't test it so maybe I'm missing something. Paolo Inviata dal mio Windows Phone Da: Yifan LImailto:iamyifa...@gmail.com Inviato: ‎04/‎12/‎2014 09:27 A:

Re: How to enforce RDD to be cached?

2014-12-03 Thread Paolo Platter
Yes, otherwise you can try: rdd.cache().count() and then run your benchmark Paolo Da: Daniel Darabosmailto:daniel.dara...@lynxanalytics.com Data invio: ?mercoled?? ?3? ?dicembre? ?2014 ?12?:?28 A: shahabmailto:shahab.mok...@gmail.com Cc: user@spark.apache.orgmailto:user@spark.apache.org On

Spark Shell strange worker Exception

2014-10-27 Thread Paolo Platter
Hi all, I’m submitting a simple task using the spark shell against a cassandraRDD ( Datastax Environment ). I’m getting the following eception from one of the workers: INFO 2014-10-27 14:08:03 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO 2014-10-27 14:08:03 Remoting: Starting

R: Spark as a Library

2014-09-16 Thread Paolo Platter
Hi, Spark job server by ooyala is the right tool for the job. It exposes rest api so calling it from a web app is suitable. Is open source, you can find it on github Best Paolo Platter Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com Inviato

Spark NLP

2014-09-10 Thread Paolo Platter
a dictionary for italian Language. Any suggestions ? Thanks Paolo Platter

RE: Spark and Shark

2014-09-01 Thread Paolo Platter
We tried to connect the old Simba Shark ODBC driver to the Thrift JDBC Server with Spark 1.1 RC2 and it works fine. Best Paolo Paolo Platter Agile Lab CTO Da: Michael Armbrust mich...@databricks.com Inviato: lunedì 1 settembre 2014 19:43 A: arthur.hk.c