You can try query push down by creating the query while creating the rdd. On 2 Dec 2015 12:32, "Fengdong Yu" <fengdo...@everstring.com> wrote:
> It depends on many situations: > > 1) what’s your data format? csv(text) or ORC/parquet? > 2) Did you have Data warehouse to summary/cluster your data? > > > if your data is text or you query for the raw data, It should be slow, > Spark cannot do much to optimize your job. > > > > > On Dec 2, 2015, at 9:21 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote: > > Mark, We have an application that use data from different kind of source, > and we build a engine able to handle that, but cant scale with big data(we > could but is to time expensive), and doesn't have Machine learning module, > etc, we came across with Spark and it's looks like it have all we need, > actually it does, but our latency is very low right now, and when we do > some testing it took too long time for the same kind of results, always > against RDBM which is our primary source. > > So, we want to expand our sources, to cvs, web service, big data, etc, we > can extend our engine or use something like Spark, which give as power of > clustering, different kind of source access, streaming, machine learning, > easy extensibility and so on. > > On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> I'd ask another question first: If your SQL query can be executed in a >> performant fashion against a conventional (RDBMS?) database, why are you >> trying to use Spark? How you answer that question will be the key to >> deciding among the engineering design tradeoffs to effectively use Spark or >> some other solution. >> >> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote: >> >>> Ok, so latency problem is being generated because I'm using SQL as >>> source? how about csv, hive, or another source? >>> >>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> It is not designed for interactive queries. >>>> >>>> >>>> You might want to ask the designers of Spark, Spark SQL, and >>>> particularly some things built on top of Spark (such as BlinkDB) about >>>> their intent with regard to interactive queries. Interactive queries are >>>> not the only designed use of Spark, but it is going too far to claim that >>>> Spark is not designed at all to handle interactive queries. >>>> >>>> That being said, I think that you are correct to question the wisdom of >>>> expecting lowest-latency query response from Spark using SQL (sic, >>>> presumably a RDBMS is intended) as the datastore. >>>> >>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>>> Hmm it will never be faster than SQL if you use SQL as an underlying >>>>> storage. Spark is (currently) an in-memory batch engine for iterative >>>>> machine learning workloads. It is not designed for interactive queries. >>>>> Currently hive is going into the direction of interactive queries. >>>>> Alternatives are Hbase on Phoenix or Impala. >>>>> >>>>> On 01 Dec 2015, at 21:58, Andrés Ivaldi <iaiva...@gmail.com> wrote: >>>>> >>>>> Yes, >>>>> The use case would be, >>>>> Have spark in a service (I didnt invertigate this yet), through api >>>>> calls of this service we perform some aggregations over data in SQL, We >>>>> are >>>>> already doing this with an internal development >>>>> >>>>> Nothing complicated, for instance, a table with Product, Product >>>>> Family, cost, price, etc. Columns like Dimension and Measures, >>>>> >>>>> I want to Spark for query that table to perform a kind of rollup, with >>>>> cost as Measure and Prodcut, Product Family as Dimension >>>>> >>>>> Only 3 columns, it takes like 20s to perform that query and the >>>>> aggregation, the query directly to the database with a grouping at the >>>>> columns takes like 1s >>>>> >>>>> regards >>>>> >>>>> >>>>> >>>>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>> >>>>>> can you elaborate more on the use case? >>>>>> >>>>>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <iaiva...@gmail.com> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > I'd like to use spark to perform some transformations over data >>>>>> stored inSQL, but I need low Latency, I'm doing some test and I run into >>>>>> spark context creation and data query over SQL takes too long time. >>>>>> > >>>>>> > Any idea for speed up the process? >>>>>> > >>>>>> > regards. >>>>>> > >>>>>> > -- >>>>>> > Ing. Ivaldi Andres >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ing. Ivaldi Andres >>>>> >>>>> >>>> >>> >>> >>> -- >>> Ing. Ivaldi Andres >>> >> >> > > > -- > Ing. Ivaldi Andres > > >