Re: Subsecond queries possible?

2015-07-01 Thread Eric Pederson
I removed all of the indices from the table in IQ and the time went up to
700ms for the query on the full dataset.   The best time I've got so far
with Spark for the full dataset is 4s with a cached table and 30 cores.

However, every column in IQ is automatically indexed by default
,
and those indexes you can't remove.  They aren't even listed in the
metadata.  So even though I removed all of the indexes the default indexes
are still there.

It's a baseline but I'm really comparing apples and oranges right now.
 But it's an interesting experiment nonetheless.



-- Eric

On Wed, Jul 1, 2015 at 12:47 PM, Debasish Das 
wrote:

> If you take bitmap indices out of sybase then I am guessing spark sql will
> be at par with sybase ?
>
> On that note are there plans of integrating indexed rdd ideas to spark sql
> to build indices ? Is there a JIRA tracking it ?
> On Jun 30, 2015 7:29 PM, "Eric Pederson"  wrote:
>
>> Hi Debasish:
>>
>> We have the same dataset running on SybaseIQ and after the caches are
>> warm the queries come back in about 300ms.  We're looking at options to
>> relieve overutilization and to bring down licensing costs.  I realize that
>> Spark may not be the best fit for this use case but I'm interested to see
>> how far it can be pushed.
>>
>> Thanks for your help!
>>
>>
>> -- Eric
>>
>> On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das 
>> wrote:
>>
>>> I got good runtime improvement from hive partitioninp, caching the
>>> dataset and increasing the cores through repartition...I think for your
>>> case generating mysql style indexing will help further..it is not supported
>>> in spark sql yet...
>>>
>>> I know the dataset might be too big for 1 node mysql but do you have a
>>> runtime estimate from running the same query on mysql with appropriate
>>> column indexing ? That should give us a good baseline number...
>>>
>>> For my case at least I could not put the data on 1 node mysql as it was
>>> big...
>>>
>>> If you can write the problem in a document view you can use a document
>>> store like solr/elastisearch to boost runtime...the reverse indices can get
>>> you subsecond latencies...again the schema design matters for that and you
>>> might have to let go some of sql expressiveness (like balance in a
>>> predefined bucket might be fine but looking for the exact number might be
>>> slow)
>>>
>>
>>


Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
If you take bitmap indices out of sybase then I am guessing spark sql will
be at par with sybase ?

On that note are there plans of integrating indexed rdd ideas to spark sql
to build indices ? Is there a JIRA tracking it ?
On Jun 30, 2015 7:29 PM, "Eric Pederson"  wrote:

> Hi Debasish:
>
> We have the same dataset running on SybaseIQ and after the caches are warm
> the queries come back in about 300ms.  We're looking at options to relieve
> overutilization and to bring down licensing costs.  I realize that Spark
> may not be the best fit for this use case but I'm interested to see how far
> it can be pushed.
>
> Thanks for your help!
>
>
> -- Eric
>
> On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das 
> wrote:
>
>> I got good runtime improvement from hive partitioninp, caching the
>> dataset and increasing the cores through repartition...I think for your
>> case generating mysql style indexing will help further..it is not supported
>> in spark sql yet...
>>
>> I know the dataset might be too big for 1 node mysql but do you have a
>> runtime estimate from running the same query on mysql with appropriate
>> column indexing ? That should give us a good baseline number...
>>
>> For my case at least I could not put the data on 1 node mysql as it was
>> big...
>>
>> If you can write the problem in a document view you can use a document
>> store like solr/elastisearch to boost runtime...the reverse indices can get
>> you subsecond latencies...again the schema design matters for that and you
>> might have to let go some of sql expressiveness (like balance in a
>> predefined bucket might be fine but looking for the exact number might be
>> slow)
>>
>
>


Re: Subsecond queries possible?

2015-06-30 Thread Eric Pederson
Hi Debasish:

We have the same dataset running on SybaseIQ and after the caches are warm
the queries come back in about 300ms.  We're looking at options to relieve
overutilization and to bring down licensing costs.  I realize that Spark
may not be the best fit for this use case but I'm interested to see how far
it can be pushed.

Thanks for your help!


-- Eric

On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das 
wrote:

> I got good runtime improvement from hive partitioninp, caching the dataset
> and increasing the cores through repartition...I think for your case
> generating mysql style indexing will help further..it is not supported in
> spark sql yet...
>
> I know the dataset might be too big for 1 node mysql but do you have a
> runtime estimate from running the same query on mysql with appropriate
> column indexing ? That should give us a good baseline number...
>
> For my case at least I could not put the data on 1 node mysql as it was
> big...
>
> If you can write the problem in a document view you can use a document
> store like solr/elastisearch to boost runtime...the reverse indices can get
> you subsecond latencies...again the schema design matters for that and you
> might have to let go some of sql expressiveness (like balance in a
> predefined bucket might be fine but looking for the exact number might be
> slow)
>


Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...

I know the dataset might be too big for 1 node mysql but do you have a
runtime estimate from running the same query on mysql with appropriate
column indexing ? That should give us a good baseline number...

For my case at least I could not put the data on 1 node mysql as it was
big...

If you can write the problem in a document view you can use a document
store like solr/elastisearch to boost runtime...the reverse indices can get
you subsecond latencies...again the schema design matters for that and you
might have to let go some of sql expressiveness (like balance in a
predefined bucket might be fine but looking for the exact number might be
slow)


Subsecond queries possible?

2015-06-30 Thread Eric Pederson
Hi Debashish -

No, there are actually 14 columns any of which can be specified at runtime
by the user.  There is a UI which allows the user to specify predicates on
any of the 14 columns.  They press submit this form and we generate
a filter like below;

val fnm303 = spp.filter("product = 'FNM30' and coupon = '3.0' and cs < 700
and state = 'NY' and occ_status = 'Y' and prop_type = 'MF' and balance >
25")


Then we group/aggregate/collect based on that filter (and hopefully get an
answer back in under a second).  We can't preaggregate the data because of
the combinatorial explosion.

When I get back to the office I'll get you the numbers for the uncached,
partitioned queries.  I remember that for a partitioned Parquet table of
the full dataset it is 3.5 sec using 20 cores.

Thanks,

On Tuesday, June 30, 2015, Debasish Das > wrote:

> You have 4 columns right...You should be able to extract 3 KV structure:
> (product, coupon), (product, cs) and (product, balance)...Now map your SQL
> to a API calls and let the API combine the results from 3 KV scans to
> construct the result...I agree that as this become more complicated you
> will move from KV store to Spark SQL...
>
> You should be able to partition the dataset through HiveContext and I
> think even Spark 1.3/1.4 supports writing/reading through the partitioned
> table...but I did not try it yet to see performance impact...documentation
> says Spark SQL should read partitioned table...
>
> Could you please share your results with partitioned tables ?
>
> On Tue, Jun 30, 2015 at 5:24 AM, Eric Pederson  wrote:
>
>> Hi Deb -
>>
>> One other consideration is that the filter will always contain two key
>> columns (product and coupon) but will also contain other columns.  For
>> example, instead of
>>
>> val fnm303 = spp.filter("product = 'FNM30' and coupon = '3.0'")
>>
>>
>> the user could ask for:
>>
>> val fnm303 = spp.filter("product = 'FNM30' and coupon = '3.0' and cs <
>> 700 and balance > 25")
>>
>>
>> The filter conditions are supplied by the user in a UI.  So the number of
>> possible combinations is way too large to, as you suggest, pre-aggregate
>> and query against the pre-aggregated data.   Because of this we'll always
>> have to scan through all of the data.
>>
>> I think we might be able to throw more machines at the problem now.
>> Previously the time flattened out around 10 cores (ie, 20, 30 cores made no
>> improvement).  But that was before I reduced spark.sql.shuffle.partitions.
>>
>>
>> This brings up another question/issue - there doesn't seem to be a way to
>> partition cached tables in the same way you can partition, say a Hive
>> table.  For example, we would like to partition the overall dataset (233m
>> rows, 9.2Gb) by (product, coupon) so when we run one of these queries
>> Spark won't have to scan all the data, just the partition from the query,
>> eg, (FNM30, 3.0).
>>
>> Thanks!
>>
>> -- Eric
>>
>> On Mon, Jun 29, 2015 at 11:29 PM, Debasish Das 
>> wrote:
>>
>>> Eric,
>>>
>>> We are discussed use-case like these over here and the dataset sizes are
>>> similar to what you mentioned.
>>>
>>> I could extract 3 second out of Spark SQL but with Spark IndexedRDD we
>>> can get to ~ 100 ms domain. But I feel that Memcached/HBase will take us to
>>> 10 ms domain if you can extract they key-value structure out of your SQL
>>> Rows. Basically your groupBy flow will generate the data that you will put
>>> on a kv store.
>>>
>>> https://github.com/amplab/spark-indexedrdd/issues/5
>>>
>>> I will update the issue further as I finish the comparisons.
>>>
>>> I feel to serve the models in a meaningful way, we need Spark Job access
>>> (perhaps through RDD and IndexedRDD) and a key-value based solution. I am
>>> not yet sure whether for serving models, do we need a write optimized
>>> solution like HBase/Cassandra or read optimized Memcached/Redis is
>>> sufficient.
>>>
>>> Thank.
>>> Deb
>>>
>>> On Mon, Jun 29, 2015 at 5:57 PM, Eric Pederson 
>>> wrote:
>>>
>>>> Hi Ilya -
>>>>
>>>> I will try that - thanks.  However, I want to aggregate across the
>>>> dataset, not do lookups.   Does your advice still apply?
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Mo

Re: Subsecond queries possible?

2015-06-30 Thread Michael Armbrust
>
> This brings up another question/issue - there doesn't seem to be a way to
> partition cached tables in the same way you can partition, say a Hive
> table.  For example, we would like to partition the overall dataset (233m
> rows, 9.2Gb) by (product, coupon) so when we run one of these queries
> Spark won't have to scan all the data, just the partition from the query,
> eg, (FNM30, 3.0).
>

If you order the data on the interesting column before caching, we keep
min/max statistics that let us do similar data skipping automatically.