Re: Offline elastic index creation

2022-11-10 Thread Debasish Das
Hi Vibhor,

We worked on a project to create lucene indexes using spark but the project
has not been managed for some time now. If there is interest we can
resurrect it

https://github.com/vsumanth10/trapezium/blob/master/dal/src/test/scala/com/verizon/bda/trapezium/dal/lucene/LuceneIndexerSuite.scala
https://www.databricks.com/session/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building

After lucene indexes were created we uploaded it to solr for search ui. We
did not ingest it to elastisearch though.

Our scale was 100m+ rows and 100k+ columns, spark + lucene worked fine

Thank you.
Deb


On Wed, Nov 9, 2022, 10:13 AM Vibhor Gupta 
wrote:

> Hi Spark Community,
>
> Is there a way to create elastic indexes offline and then import them to
> an elastic cluster ?
> We are trying to load an elastic index with around 10B documents (~1.5 to
> 2 TB data) using spark daily.
>
> I know elastic provides a snapshot restore functionality through
> GCS/S3/Azure, but is there a way to generate this snapshot offline using
> spark ?
>
> Thanks,
> Vibhor Gupta
>


Re: dremel paper example schema

2018-10-29 Thread Debasish Das
Open source impl of dremel is parquet !

On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta 
wrote:

> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the example from dremel paper
>> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark
>> using
>> pyspark and I wonder if it is possible at all?
>>
>> Trying to follow the paper example as close as possible I created this
>> document type:
>>
>> from pyspark.sql.types import *
>>
>> links_type = StructType([
>> StructField("Backward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> StructField("Forward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> ])
>>
>> language_type = StructType([
>> StructField("Code", StringType(), nullable=False),
>> StructField("Country", StringType())
>> ])
>>
>> names_type = StructType([
>> StructField("Language", ArrayType(language_type, containsNull=False)),
>> StructField("Url", StringType()),
>> ])
>>
>> document_type = StructType([
>> StructField("DocId", LongType(), nullable=False),
>> StructField("Links", links_type, nullable=True),
>> StructField("Name", ArrayType(names_type, containsNull=False))
>> ])
>>
>> But when I store data in parquet using this type, the resulting parquet
>> schema is different from the described in the paper:
>>
>> message spark_schema {
>>   required int64 DocId;
>>   optional group Links {
>> required group Backward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>> required group Forward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>>   }
>>   optional group Name (LIST) {
>> repeated group list {
>>   required group element {
>> optional group Language (LIST) {
>>   repeated group list {
>> required group element {
>>   required binary Code (UTF8);
>>   optional binary Country (UTF8);
>> }
>>   }
>> }
>> optional binary Url (UTF8);
>>   }
>> }
>>   }
>> }
>>
>> Moreover, if I create a parquet file with schema described in the dremel
>> paper using Apache Parquet Java API and try to read it into Apache Spark,
>> I
>> get an exception:
>>
>> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
>> while reading parquet files. One possible cause: Parquet column cannot be
>> converted in the corresponding files
>>
>> Is it possible to create example schema described in the dremel paper
>> using
>> Apache Spark and what is the correct approach to build this example?
>>
>> Regards,
>> Lubomir Chorbadjiev
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi,

ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve each alternating steps with ECOS:

https://github.com/embotech/ecos-java-scala

For distributed optimization, I expect it will be useful where for each
primary row key (sensor, car, robot :-) we are fitting a constrained
quadratic / cone program. Please try it out and let me know the feedbacks.

Thanks.
Deb


Re: Restful API Spark Application

2017-05-16 Thread Debasish Das
You can run l
On May 15, 2017 3:29 PM, "Nipun Arora"  wrote:

> Thanks all for your response. I will have a look at them.
>
> Nipun
>
> On Sat, May 13, 2017 at 2:38 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> It's in scala but it should be portable in java
>> https://github.com/vgkowski/akka-spark-experiments
>>
>>
>> Le 12 mai 2017 10:54 PM, "Василец Дмитрий"  a
>> écrit :
>>
>> and livy https://hortonworks.com/blog/livy-a-rest-interface-for-
>> apache-spark/
>>
>> On Fri, May 12, 2017 at 10:51 PM, Sam Elamin 
>> wrote:
>> > Hi Nipun
>> >
>> > Have you checked out the job servwr
>> >
>> > https://github.com/spark-jobserver/spark-jobserver
>> >
>> > Regards
>> > Sam
>> > On Fri, 12 May 2017 at 21:00, Nipun Arora 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> We have written a java spark application (primarily uses spark sql). We
>> >> want to expand this to provide our application "as a service". For
>> this, we
>> >> are trying to write a REST API. While a simple REST API can be easily
>> made,
>> >> and I can get Spark to run through the launcher. I wonder, how the
>> spark
>> >> context can be used by service requests, to process data.
>> >>
>> >> Are there any simple JAVA examples to illustrate this use-case? I am
>> sure
>> >> people have faced this before.
>> >>
>> >>
>> >> Thanks
>> >> Nipun
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:

> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
> to find approximately nearest neighbors. Basically, We have dataset with
> about 7M rows. we want to use cosine distance to meassure the similarity
> between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-05 Thread Debasish Das
Hi Aseem,

Due to production deploy, we did not upgrade to 2.0 but that's critical
item on our list.

For exposing models out of PipelineModel, let me look into the ML
tasks...we should add it since dataframe should not be must for model
scoring...many times model are scored on api or streaming paths which don't
have micro batching involved...data directly lands from http or kafka/msg
queues...for such cases raw access to ML model is essential similar to
mllib model access...

Thanks.
Deb
On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbans...@gmail.com> wrote:

> @Debasish
>
> I see that the spark version being used in the project that you mentioned
> is 1.6.0. I would suggest that you take a look at some blogs related to
> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
> of latest Spark 2.1.0 release has no way to call predict on single vector.
> There is no API exposed. It is WIP but not yet released.
>
> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> If we expose an API to access the raw models out of PipelineModel can't
>> we call predict directly on it from an API ? Is there a task open to expose
>> the model out of PipelineModel so that predict can be called on itthere
>> is no dependency of spark context in ml model...
>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>
>>>
>>>- In Spark 2.0 there is a class called PipelineModel. I know that
>>>the title says pipeline but it is actually talking about PipelineModel
>>>trained via using a Pipeline.
>>>- Why PipelineModel instead of pipeline? Because usually there is a
>>>series of stuff that needs to be done when doing ML which warrants an
>>>ordered sequence of operations. Read the new spark ml docs or one of the
>>>databricks blogs related to spark pipelines. If you have used python's
>>>sklearn library the concept is inspired from there.
>>>- "once model is deserialized as ml model from the store of choice
>>>within ms" - The timing of loading the model was not what I was
>>>referring to when I was talking about timing.
>>>- "it can be used on incoming features to score through
>>>spark.ml.Model predict API". The predict API is in the old mllib package
>>>not the new ml package.
>>>- "why r we using dataframe and not the ML model directly from API"
>>>- Because as of now the new ml package does not have the direct API.
>>>
>>>
>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure why I will use pipeline to do scoring...idea is to build
>>>> a model, use model ser/deser feature to put it in the row or column store
>>>> of choice and provide a api access to the model...we support these
>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>> spark context in local or distributed mode...once model is deserialized as
>>>> ml model from the store of choice within ms, it can be used on incoming
>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>> 2200x speedup...why r we using dataframe and not the ML model directly from
>>>> API ?
>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>> What is your timezone in case someone wanted to talk?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>>> wrote:
>>>>>
>>>>>> Hey Aseem,
>>>>>>
>>>>>> We have built pipelines that execute several string indexers, one hot
>>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>>> Execution time for the linear regression was on the order of 11
>>>>>> microseconds, a bit longer for random forest. This can be further 
>>>>>> optimized
>>>>>> by using row-based transformations if your pipeline is simple to around 
>>>>>> 2-3
>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>>> the time all the processing was done, we had somewhere around 1000 
>>>>>> features
>>>>>> or so going into the linear regression after one hot encoding and
>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
Except of course lda als and neural net modelfor them the model need to
be either prescored and cached on a kv store or the matrices / graph should
be kept on kv store to access them using a REST API to serve the
output..for neural net its more fun since its a distributed or local  graph
over which tensorflow compute needs to run...

In trapezium we support writing these models to store like cassandra and
lucene for example and then provide config driven akka-http based API to
add the business logic to access these model from a store and expose the
model serving as REST endpoint

Matrix, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das" <debasish.da...@gmail.com> wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on itthere
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>
>>
>>- In Spark 2.0 there is a class called PipelineModel. I know that the
>>title says pipeline but it is actually talking about PipelineModel trained
>>via using a Pipeline.
>>- Why PipelineModel instead of pipeline? Because usually there is a
>>series of stuff that needs to be done when doing ML which warrants an
>>ordered sequence of operations. Read the new spark ml docs or one of the
>>databricks blogs related to spark pipelines. If you have used python's
>>sklearn library the concept is inspired from there.
>>- "once model is deserialized as ml model from the store of choice
>>within ms" - The timing of loading the model was not what I was
>>referring to when I was talking about timing.
>>- "it can be used on incoming features to score through
>>spark.ml.Model predict API". The predict API is in the old mllib package
>>not the new ml package.
>>- "why r we using dataframe and not the ML model directly from API" -
>>Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>
>>>> Does this support Java 7?
>>>> What is your timezone in case someone wanted to talk?
>>>>
>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey Aseem,
>>>>>
>>>>> We have built pipelines that execute several string indexers, one hot
>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>> Execution time for the linear regression was on the order of 11
>>>>> microseconds, a bit longer for random forest. This can be further 
>>>>> optimized
>>>>> by using row-based transformations if your pipeline is simple to around 
>>>>> 2-3
>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>> the time all the processing was done, we had somewhere around 1000 
>>>>> features
>>>>> or so going into the linear regression after one hot encoding and
>>>>> everything else.
>>>>>
>>>>> Hope this helps,
>>>>> Hollin
>>>>>
>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>>>> wrote:
>>>>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
If we expose an API to access the raw models out of PipelineModel can't we
call predict directly on it from an API ? Is there a task open to expose
the model out of PipelineModel so that predict can be called on itthere
is no dependency of spark context in ml model...
On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:

>
>- In Spark 2.0 there is a class called PipelineModel. I know that the
>title says pipeline but it is actually talking about PipelineModel trained
>via using a Pipeline.
>- Why PipelineModel instead of pipeline? Because usually there is a
>series of stuff that needs to be done when doing ML which warrants an
>ordered sequence of operations. Read the new spark ml docs or one of the
>databricks blogs related to spark pipelines. If you have used python's
>sklearn library the concept is inspired from there.
>- "once model is deserialized as ml model from the store of choice
>within ms" - The timing of loading the model was not what I was
>referring to when I was talking about timing.
>- "it can be used on incoming features to score through spark.ml.Model
>predict API". The predict API is in the old mllib package not the new ml
>package.
>- "why r we using dataframe and not the ML model directly from API" -
>Because as of now the new ml package does not have the direct API.
>
>
> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> I am not sure why I will use pipeline to do scoring...idea is to build a
>> model, use model ser/deser feature to put it in the row or column store of
>> choice and provide a api access to the model...we support these primitives
>> in github.com/Verizon/trapezium...the api has access to spark context in
>> local or distributed mode...once model is deserialized as ml model from the
>> store of choice within ms, it can be used on incoming features to score
>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>> r we using dataframe and not the ML model directly from API ?
>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>
>>> Does this support Java 7?
>>> What is your timezone in case someone wanted to talk?
>>>
>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>> wrote:
>>>
>>>> Hey Aseem,
>>>>
>>>> We have built pipelines that execute several string indexers, one hot
>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>> Execution time for the linear regression was on the order of 11
>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>> the time all the processing was done, we had somewhere around 1000 features
>>>> or so going into the linear regression after one hot encoding and
>>>> everything else.
>>>>
>>>> Hope this helps,
>>>> Hollin
>>>>
>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>>> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is computational time for predictions on the order of few
>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>
>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>>
>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>>>> about MLeap and how you can use it to build production services from 
>>>>>>> your
>>>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that 
>>>>>>> allows
>>>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>>>> based on Protobuf 3 and JSON.
&g

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
local or distributed mode...once model is deserialized as ml model from the
store of choice within ms, it can be used on incoming features to score
through spark.ml.Model predict API...I am not clear on 2200x speedup...why
r we using dataframe and not the ML model directly from API ?
On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:

> Does this support Java 7?
> What is your timezone in case someone wanted to talk?
>
> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins  wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>>> wrote:
>>>
 Is computational time for predictions on the order of few milliseconds
 (< 10 ms) like the old mllib library?

 On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
 wrote:

> Hey everyone,
>
>
> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
> about MLeap and how you can use it to build production services from your
> Spark-trained ML pipelines. MLeap is an open-source technology that allows
> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
> Models to a scoring engine instantly. The MLeap execution engine has no
> dependencies on a Spark context and the serialization format is entirely
> based on Protobuf 3 and JSON.
>
>
> The recent 0.5.0 release provides serialization and inference support
> for close to 100% of Spark transformers (we don’t yet support ALS and 
> LDA).
>
>
> MLeap is open-source, take a look at our Github page:
>
> https://github.com/combust/mleap
>
>
> Or join the conversation on Gitter:
>
> https://gitter.im/combust/mleap
>
>
> We have a set of documentation to help get you started here:
>
> http://mleap-docs.combust.ml/
>
>
> We even have a set of demos, for training ML Pipelines and linear,
> logistic and random forest models:
>
> https://github.com/combust/mleap-demo
>
>
> Check out our latest MLeap-serving Docker image, which allows you to
> expose a REST interface to your Spark ML pipeline models:
>
> http://mleap-docs.combust.ml/mleap-serving/
>
>
> Several companies are using MLeap in production and even more are
> currently evaluating it. Take a look and tell us what you think! We hope 
> to
> talk with you soon and welcome feedback/suggestions!
>
>
> Sincerely,
>
> Hollin and Mikhail
>


>>>
>>
>


Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Debasish Das
You may want to pull up release/1.2 branch and 1.2.0 tag to build it
yourself incase the packages are not available.
On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim" 
wrote:

> Hi Ayan,
>
> Thanks a million.
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 15 January 2017 at 22:48, ayan guha  wrote:
>
>> archive.apache.org will always have all the releases:
>> http://archive.apache.org/dist/spark/
>>
>> @Spark users: it may be a good idea to have a "To download older
>> versions, click here" link to Spark Download page?
>>
>> On Mon, Jan 16, 2017 at 8:16 AM, Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> Hi,
>>>
>>> I am looking for Spark 1.2.0 version. I tried to download in the Spark
>>> website but it's no longer available.
>>>
>>> Any suggestion?
>>>
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> _
>>> *Md. Rezaul Karim*, BSc, MSc
>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> 
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
Hi Manoj,

I have a spark meetup talk that explains the issues with dimsum where you
have to calculate row similarities. You can still use the PR since it has
all the code you need but I have not got time to refactor it for the merge.
I believe few kernels are supported as well.

Thanks.
Deb
On Jul 7, 2016 8:13 PM, "Manoj Awasthi" <awasthi.ma...@gmail.com> wrote:

>
> Hi Debasish, All,
>
> I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't
> gather from the relevant pull request [1] if part of it is already in 1.6.0
> (it's closed now). We are facing the same problem of computing pairwise
> distances between vectors where rows are > 5M and columns in tens (20 to be
> specific). DIMSUM doesn't help because of obvious reasons (transposing the
> matrix infeasible) already discussed in JIRA.
>
> Is there an update on the JIRA ticket above and can I use something to
> compute RowSimilarity in spark 1.6.0 on my dataset? I will be thankful for
> any other ideas too on this.
>
> - Manoj
>
> [0] https://issues.apache.org/jira/browse/SPARK-4823
> [1] https://github.com/apache/spark/pull/6213
>
>
>
> On Thu, Apr 30, 2015 at 6:40 PM, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Thank you guys for the input.
>>
>> Ayan, I am not sure how this can be done using reduceByKey, as far as I
>> can see (but I am not so advanced with Spark), this requires a groupByKey
>> which can be very costly. What would be nice to transform the dataset which
>> contains all the vectors like:
>>
>>
>> val localData = data.zipWithUniqueId().map(_.swap) // Provide some keys
>> val cartesianProduct = localData.cartesian(localData) // Provide the pairs
>> val groupedByKey = cartesianProduct.groupByKey()
>>
>> val neighbourhoods = groupedByKey.map {
>>   case (point: (Long, VectorWithNormAndClass), points: Iterable[(Long,
>> VectorWithNormAndClass)]) => {
>> val distances = points.map {
>>   case (idxB: Long, pointB: VectorWithNormAndClass) =>
>> (idxB, MLUtils.fastSquaredDistance(point._2.vector,
>> point._2.norm, pointB.vector, pointB.norm))
>> }
>>
>> val kthDistance =
>> distances.sortBy(_._2).take(K).max(compareByDistance)
>>
>> (point, distances.filter(_._2 <= kthDistance._2))
>>   }
>> }
>>
>> This is part of my Local Outlier Factor implementation.
>>
>> Of course the distances can be sorted because it is an Iterable, but it
>> gives an idea. Is it possible to make this more efficient? I don't want to
>> use probabilistic functions, and I will cache the matrix because many
>> distances are looked up at the matrix, computing them on demand would
>> require far more computations.​
>>
>> ​​Kind regards,
>> Fokko
>>
>>
>>
>> 2015-04-30 4:39 GMT+02:00 Debasish Das <debasish.da...@gmail.com>:
>>
>>> Cross Join shuffle space might not be needed since most likely through
>>> application specific logic (topK etc) you can cut the shuffle space...Also
>>> most likely the brute force approach will be a benchmark tool to see how
>>> better is your clustering based KNN solution since there are several ways
>>> you can find approximate nearest neighbors for your application
>>> (KMeans/KDTree/LSH etc)...
>>>
>>> There is a variant that I will bring as a PR for this JIRA and we will
>>> of course look into how to improve it further...the idea is to think about
>>> distributed matrix multiply where both matrices A and B are distributed and
>>> master coordinates pulling a partition of A and multiply it with B...
>>>
>>> The idea suffices for kernel matrix generation as well if the number of
>>> rows are modest (~10M or so)...
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4823
>>>
>>>
>>> On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> This is my first thought, please suggest any further improvement:
>>>> 1. Create a rdd of your dataset
>>>> 2. Do an cross join to generate pairs
>>>> 3. Apply reducebykey and compute distance. You will get a rdd with
>>>> keypairs and distance
>>>>
>>>> Best
>>>> Ayan
>>>> On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote:
>>>>
>>>>> Dear Sparkers,
>>>>>
>>>>> I am working on an algorithm which requires the pair distance between
>>>>> all points (eg. DBScan

Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers"  wrote:

> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom 
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>>> wrote:
>>>
 the re-use of shuffle files is always a nice surprise to me

 On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
 wrote:

> Same SparkContext means same pool of Workers.  It's up to the
> Scheduler, not the SparkContext, whether the exact same Workers or
> Executors will be used to calculate simultaneous actions against the same
> RDD.  It is likely that many of the same Workers and Executors will be 
> used
> as the Scheduler tries to preserve data locality, but that is not
> guaranteed.  In fact, what is most likely to happen is that the shared
> Stages and Tasks being calculated for the simultaneous actions will not
> actually be run at exactly the same time, which means that shuffle files
> produced for one action will be reused by the other(s), and repeated
> calculations will be avoided even without explicitly caching/persisting 
> the
> RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" 
>> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>> because
>>> they are read only object). however, I want to know if the same workers 
>>> are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>
 I stand corrected. How considerable are the benefits though? Will
 the scheduler be able to dispatch jobs from both actions 
 simultaneously (or
 on a when-workers-become-available basis)?

 On 15 January 2016 at 11:44, Koert Kuipers 
 wrote:

> we run multiple actions on the same (cached) rdd all the time, i
> guess in different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use
>> them this way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions
>> in parallel? The idea behind RDDs is to provide you with an 
>> abstraction for
>> computing parallel operations on distributed data. Even if you were 
>> to call
>> actions from several threads at once, the individual executors of 
>> your
>> spark environment would still have to perform operations 
>> sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single 
>> operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney > > wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira 
>>> escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes
 how can this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com 

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
Yeah for this you can use breeze quadratic minimizer...that's integrated
with spark in one of my spark pr.

You have quadratic objective with equality which is primal and your
proximal is positivity that we already support. I have not given an API for
linear objective but that should be simple to add. You can add an issue in
breeze for the enhancememt.

Alternatively you can use breeze lpsolver as well that uses simplex from
apache math.
On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com> wrote:

> Hi Debasish Das,
>
> Firstly I must show my deep appreciation towards you kind help.
>
> Yes, my issue is some typical LP related, it is as:
> Objective function:
> f(x1, x2, ..., xn) = a1 * x1 + a2 * x2 + ... + an * xn,   (n would be some
> number bigger than 100)
>
> There are only 4 constraint functions,
> x1 + x2 + ... + xn = 1, 1)
> b1 * x1 + b2 * x2 + ... + bn * xn = b, 2)
> c1 * x1 + c2 * x2 + ... + cn * xn = c, 3)
> x1, x2, ..., xn >= 0 .
>
> To find the solution of x which lets objective function the biggest.
>
> Since simplex method may not be supported by spark. Then I may switch to
> the way as, since the likely solution x must be on the boundary of 1), 2)
> and 3) geometry,
> that is to say, only three xi may be >= 0, all the others must be 0.
> Just look for all that kinds of solutions of 1), 2) and 3), the number
> would be C(n, 3) + C(n, 2) + C(n, 1), at last to select the most optimized
> one.
>
> Since the constraint number is not that large, I think this might be some
> way.
>
> Thank you,
> Zhiliang
>
>
> On Wednesday, November 4, 2015 2:25 AM, Debasish Das <
> debasish.da...@gmail.com> wrote:
>
>
> Spark has nnls in mllib optimization. I have refactored nnls to breeze as
> well but we could not move out nnls from mllib due to some runtime issues
> from breeze.
> Issue in spark or breeze nnls is that it takes dense gram matrix which
> does not scale if rank is high but it has been working fine for nnmf till
> 400 rank.
> I agree with Sean that you need to see if really simplex is needed. Many
> constraints can be formulated as proximal operator and then you can use
> breeze nonlinearminimizer or spark-tfocs package if it is stable.
> On Nov 2, 2015 10:13 AM, "Sean Owen" <so...@cloudera.com> wrote:
>
> I might be steering this a bit off topic: does this need the simplex
> method? this is just an instance of nonnegative least squares. I don't
> think it relates to LDA either.
>
> Spark doesn't have any particular support for NNLS (right?) or simplex
> though.
>
> On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Use breeze simplex which inturn uses apache maths simplex...if you want
> to
> > use interior point method you can use ecos
> > https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
> > quadratic solver in matrix factorization will show you example
> integration
> > with spark. ecos runs as jni process in every executor.
> >
> > On Nov 1, 2015 9:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid>
> wrote:
> >>
> >> Hi Ted Yu,
> >>
> >> Thanks very much for your kind reply.
> >> Do you just mean that in spark there is no specific package for simplex
> >> method?
> >>
> >> Then I may try to fix it by myself, do not decide whether it is
> convenient
> >> to finish by spark, before finally fix it.
> >>
> >> Thank you,
> >> Zhiliang
> >>
> >>
> >>
> >>
> >> On Monday, November 2, 2015 1:43 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>
> >>
> >> A brief search in code base shows the following:
> >>
> >> TODO: Add simplex constraints to allow alpha in (0,1).
> >> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
> >>
> >> I guess the answer to your question is no.
> >>
> >> FYI
> >>
> >> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid>
> >> wrote:
> >>
> >> Dear All,
> >>
> >> As I am facing some typical linear programming issue, and I know simplex
> >> method is specific in solving LP question,
> >> I am very sorry that whether there is already some mature package in
> spark
> >> about simplex method...
> >>
> >> Thank you very much~
> >> Best Wishes!
> >> Zhiliang
> >>
> >>
> >>
> >>
> >>
> >
>
>
>
>


Re: apply simplex method to fix linear programming in spark

2015-11-03 Thread Debasish Das
Spark has nnls in mllib optimization. I have refactored nnls to breeze as
well but we could not move out nnls from mllib due to some runtime issues
from breeze.

Issue in spark or breeze nnls is that it takes dense gram matrix which does
not scale if rank is high but it has been working fine for nnmf till 400
rank.

I agree with Sean that you need to see if really simplex is needed. Many
constraints can be formulated as proximal operator and then you can use
breeze nonlinearminimizer or spark-tfocs package if it is stable.
On Nov 2, 2015 10:13 AM, "Sean Owen" <so...@cloudera.com> wrote:

> I might be steering this a bit off topic: does this need the simplex
> method? this is just an instance of nonnegative least squares. I don't
> think it relates to LDA either.
>
> Spark doesn't have any particular support for NNLS (right?) or simplex
> though.
>
> On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Use breeze simplex which inturn uses apache maths simplex...if you want
> to
> > use interior point method you can use ecos
> > https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
> > quadratic solver in matrix factorization will show you example
> integration
> > with spark. ecos runs as jni process in every executor.
> >
> > On Nov 1, 2015 9:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid>
> wrote:
> >>
> >> Hi Ted Yu,
> >>
> >> Thanks very much for your kind reply.
> >> Do you just mean that in spark there is no specific package for simplex
> >> method?
> >>
> >> Then I may try to fix it by myself, do not decide whether it is
> convenient
> >> to finish by spark, before finally fix it.
> >>
> >> Thank you,
> >> Zhiliang
> >>
> >>
> >>
> >>
> >> On Monday, November 2, 2015 1:43 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>
> >>
> >> A brief search in code base shows the following:
> >>
> >> TODO: Add simplex constraints to allow alpha in (0,1).
> >> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
> >>
> >> I guess the answer to your question is no.
> >>
> >> FYI
> >>
> >> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid>
> >> wrote:
> >>
> >> Dear All,
> >>
> >> As I am facing some typical linear programming issue, and I know simplex
> >> method is specific in solving LP question,
> >> I am very sorry that whether there is already some mature package in
> spark
> >> about simplex method...
> >>
> >> Thank you very much~
> >> Best Wishes!
> >> Zhiliang
> >>
> >>
> >>
> >>
> >>
> >
>


Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Debasish Das
Use breeze simplex which inturn uses apache maths simplex...if you want to
use interior point method you can use ecos
https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
quadratic solver in matrix factorization will show you example integration
with spark. ecos runs as jni process in every executor.
On Nov 1, 2015 9:52 AM, "Zhiliang Zhu"  wrote:

> Hi Ted Yu,
>
> Thanks very much for your kind reply.
> Do you just mean that in spark there is no specific package for simplex
> method?
>
> Then I may try to fix it by myself, do not decide whether it is convenient
> to finish by spark, before finally fix it.
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, November 2, 2015 1:43 AM, Ted Yu  wrote:
>
>
> A brief search in code base shows the following:
>
> TODO: Add simplex constraints to allow alpha in (0,1).
> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
>
> I guess the answer to your question is no.
>
> FYI
>
> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu 
> wrote:
>
> Dear All,
>
> As I am facing some typical linear programming issue, and I know simplex
> method is specific in solving LP question,
> I am very sorry that whether there is already some mature package in spark
> about simplex method...
>
> Thank you very much~
> Best Wishes!
> Zhiliang
>
>
>
>
>
>


Re: Running 2 spark application in parallel

2015-10-23 Thread Debasish Das
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...

For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on driver...have to double check...
On Oct 22, 2015 11:41 AM, "Simon Elliston Ball" 
wrote:

> If yarn has capacity to run both simultaneously it will. You should ensure
> you are not allocating too many executors for the first app and leave some
> space for the second)
>
> You may want to run the application on different yarn queues to control
> resource allocation. If you run as a different user within the same queue
> you should also get an even split between the applications, however you may
> need to enable preemption to ensure the first doesn't just hog the queue.
>
> Simon
>
> On 22 Oct 2015, at 19:20, Suman Somasundar 
> wrote:
>
> Hi all,
>
>
>
> Is there a way to run 2 spark applications in parallel under Yarn in the
> same cluster?
>
>
>
> Currently, if I submit 2 applications, one of them waits till the other
> one is completed.
>
>
>
> I want both of them to start and run at the same time.
>
>
>
> Thanks,
> Suman.
>
>


Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze
owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1
regularization which will yield elastic net style sparse solutionsusing
that you can clean up edges which has 0.0 as weight...
On Sep 7, 2015 7:35 PM, "Feynman Liang"  wrote:

> BTW thanks for pointing out the typos, I've included them in my MLP
> cleanup PR 
>
> On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang 
> wrote:
>
>> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is
>> on the roadmap for 1.6
>>  though, and there is
>> a spark package
>>  for
>> dropout regularized logistic regression.
>>
>>
>> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> Thanks!
>>>
>>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>>> techniques that help avoiding overfitting?
>>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>>
>>> ps. There is a small copy-paste typo in
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>>> should read B :)
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>>> wrote:
>>>
 Backprop is used to compute the gradient here
 ,
 which is then optimized by SGD or LBFGS here
 

 On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> Haven't checked the actual code but that doc says "MLPC employes
> backpropagation for learning the model. .."?
>
>
>
> —
> Sent from Mailbox 
>
>
> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com> wrote:
>
>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>
>> Implementation seems missing backpropagation?
>> Was there is a good reason to omit BP?
>> What are the drawbacks of a pure feedforward-only ANN?
>>
>> Thanks!
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>

>>>
>>
>


Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.

I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's one more option I am looking into.

On Mon, Jul 27, 2015 at 10:12 PM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

  HBase in this case is no different from any other Spark SQL data
 sources, so yes you should be able to access HBase data through Astro from
 Spark SQL’s JDBC interface.



 Graphically, the access path is as follows:



 Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro
 Optimizer- HBase Scans/Gets - … - HBase Region server





 Regards,



 Yan



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Monday, July 27, 2015 10:02 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev; user
 *Subject:* RE: Package Release Annoucement: Spark SQL on HBase Astro



 Hi Yan,

 Is it possible to access the hbase table through spark sql jdbc layer ?

 Thanks.
 Deb

 On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan,

Is it possible to access the hbase table through spark sql jdbc layer ?

Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

  Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

  We are happy to announce the availability of the Spark SQL on HBase
 1.0.0 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't  havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote:

 Sorry; I think I may have used poor wording. SparkR will let you use R to
 analyze the data, but it has to be loaded into memory using SparkR (see SparkR
 DataSources
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html).
 You will still have to write a Java receiver to store the data into some
 tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
 performing the analysis.

 R specific questions such as windowing in R should go to R-help@; you
 won't be able to use window since that is a Spark Streaming method.

 On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon o...@scene53.com wrote:

 You are helping me understanding stuff here a lot.

 I believe I have 3 last questions..

 If is use java receiver to get the data, how should I save it in memory?
 Using store command or other command?

 Once stored, how R can read that data?

 Can I use window command in R? I guess not because it is a streaming
 command, right? Any other way to window the data?

 Sent from IPhone




 On Mon, Jul 13, 2015 at 2:07 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  If you use SparkR then you can analyze the data that's currently in
 memory with R; otherwise you will have to write to disk (eg HDFS).

 On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon o...@scene53.com wrote:

 Thanks again.
 What I'm missing is where can I store the data? Can I store it in spark
 memory and then use R to analyze it? Or should I use hdfs? Any other places
 that I can save the data?

 What would you suggest?

 Thanks...

 Sent from IPhone




 On Mon, Jul 13, 2015 at 1:41 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  If you don't require true streaming processing and need to use R for
 analysis, SparkR on a custom data source seems to fit your use case.

 On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon o...@scene53.com wrote:

 Hi, thanks for replying!
 I want to do the entire process in stages. Get the data using Java or
 scala because they are the only Langs that supports custom receivers, 
 keep
 the data somewhere, use R to analyze it, keep the results somewhere,
 output the data to different systems.

 I thought that somewhere can be spark memory using rdd or
 dstreams.. But could it be that I need to keep it in hdfs to make the
 entire process in stages?

 Sent from IPhone




 On Mon, Jul 13, 2015 at 12:07 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  Hi Oded,

 I'm not sure I completely understand your question, but it sounds
 like you could have the READER receiver produce a DStream which is
 windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
 However, streaming in SparkR is not currently supported (SPARK-6803
 https://issues.apache.org/jira/browse/SPARK-6803) so I'm not too
 sure how ANALYZER would fit in.

 Feynman

 On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon o...@scene53.com
 wrote:

 any help / idea will be appreciated :)
 thanks


 Regards,
 Oded Maimon
 Scene53.

 On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon o...@scene53.com
 wrote:

 Hi All,
 we are evaluating spark for real-time analytic. what we are trying
 to do is the following:

- READER APP- use custom receiver to get data from rabbitmq
(written in scala)
- ANALYZER APP - use spark R application to read the data
(windowed), analyze it every minute and save the results inside 
 spark
- OUTPUT APP - user spark application (scala/java/python) to
read the results from R every X minutes and send the data to few 
 external
systems

 basically at the end i would like to have the READER COMPONENT as
 an app that always consumes the data and keeps it in spark,
 have as many ANALYZER COMPONENTS as my data scientists wants, and
 have one OUTPUT APP that will read the ANALYZER results and send it 
 to any
 relevant system.

 what is the right way to do it?

 Thanks,
 Oded.





 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they 
 are
 addressed. Please note that any disclosure, copying or distribution of 
 the
 content of this information is strictly forbidden. If you have received
 this email message in error, please destroy it immediately and notify 
 its
 sender.*



 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they are
 addressed. Please note that any disclosure, copying or distribution of 
 the
 content of this information is strictly forbidden. If you have received
 this email message in error, please destroy it immediately and notify its
 sender.*



 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom 

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from
1000 users to 1 users ?

On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:

 I have almost the same case. I will tell you what I am actually doing, if
 it
 is according to your requirement, then I will love to help you.

 1. my database is aerospike. I get data from it.
 2. written standalone spark app (it does not run in standalone mode, but
 with simple java command or maven command), even I make its JAR file with
 simple java command and it runs spark without interactive shell.
 3. Write different methods based on different queries in standalone spark
 app.
 4. my restful API is based on NodeJS, user sends his request through
 NodeJS,
 request pass to my simple java maven project, execute particular spark
 method based on query, returns result to NodeJS in JSON format.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-application-with-a-RESTful-API-tp23654p23831.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
If you take bitmap indices out of sybase then I am guessing spark sql will
be at par with sybase ?

On that note are there plans of integrating indexed rdd ideas to spark sql
to build indices ? Is there a JIRA tracking it ?
On Jun 30, 2015 7:29 PM, Eric Pederson eric...@gmail.com wrote:

 Hi Debasish:

 We have the same dataset running on SybaseIQ and after the caches are warm
 the queries come back in about 300ms.  We're looking at options to relieve
 overutilization and to bring down licensing costs.  I realize that Spark
 may not be the best fit for this use case but I'm interested to see how far
 it can be pushed.

 Thanks for your help!


 -- Eric

 On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I got good runtime improvement from hive partitioninp, caching the
 dataset and increasing the cores through repartition...I think for your
 case generating mysql style indexing will help further..it is not supported
 in spark sql yet...

 I know the dataset might be too big for 1 node mysql but do you have a
 runtime estimate from running the same query on mysql with appropriate
 column indexing ? That should give us a good baseline number...

 For my case at least I could not put the data on 1 node mysql as it was
 big...

 If you can write the problem in a document view you can use a document
 store like solr/elastisearch to boost runtime...the reverse indices can get
 you subsecond latencies...again the schema design matters for that and you
 might have to let go some of sql expressiveness (like balance in a
 predefined bucket might be fine but looking for the exact number might be
 slow)





Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...

I know the dataset might be too big for 1 node mysql but do you have a
runtime estimate from running the same query on mysql with appropriate
column indexing ? That should give us a good baseline number...

For my case at least I could not put the data on 1 node mysql as it was
big...

If you can write the problem in a document view you can use a document
store like solr/elastisearch to boost runtime...the reverse indices can get
you subsecond latencies...again the schema design matters for that and you
might have to let go some of sql expressiveness (like balance in a
predefined bucket might be fine but looking for the exact number might be
slow)


Re: Velox Model Server

2015-06-24 Thread Debasish Das
Model sizes are 10m x rank, 100k x rank range.

For recommendation/topic modeling I can run batch recommendAll and then
keep serving the model using a distributed cache but then I can't
incorporate per user model re-predict if user feedback is making the
current topk stale. I have to wait for next batch refresh which might be 1
hr away.

spark job server + spark sql can get me fresh updates but each time running
a predict might be slow.

I am guessing the better idea might be to start with batch recommendAll and
then update the per user model if it get stale but that needs acess to the
key value store and the model over a API like spark job server. I am
running experiments with job server. In general it will be nice if my key
value store and model are both managed by same akka based API.

Yes sparksql is to filter/boost recommendation results using business logic
like user demography for example..
On Jun 23, 2015 2:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes, and typically needs are 100ms. Now imagine even 10 concurrent
 requests. My experience has been that this approach won't nearly
 scale. The best you could probably do is async mini-batch
 near-real-time scoring, pushing results to some store for retrieval,
 which could be entirely suitable for your use case.

 On Tue, Jun 23, 2015 at 8:52 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  If your recommendation needs are real-time (1s) I am not sure job server
  and computing the refs with spark will do the trick (though those new
  BLAS-based methods may have given sufficient speed up).



Re: Velox Model Server

2015-06-24 Thread Debasish Das
Thanks Nick, Sean for the great suggestions...

Since you guys have already hit these issues before I think it will be
great if we can add the learning to Spark Job Server and enhance it for
community.

Nick, do you see any major issues in using Spray over Scalatra ?

Looks like Model Server API layer needs access to a performant KV store
(Redis/Memcached), Elastisearch (we used Solr before for item-item serving
but I liked the Spark-Elastisearch integration, REST is Netty based unlike
Solr's Jetty and YARN client looks more stable and so it is worthwhile to
see if it improves over Solr based serving) and ML Models (which are moving
towards Spark SQL style in 1.3/1.4 with the introduction of Pipeline API)

An initial version of KV store might be simple LRU cache.

For KV store are there any comparisons available with IndexedRDD and
Redis/Memcached ?

Velox is using CoreOS EtcdClient (which is Go based) but I am not sure if
it is used as a full fledged distributed cache or not. May be it is being
used as zookeeper alternative.


On Wed, Jun 24, 2015 at 2:02 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Ok

 My view is with only 100k items, you are better off serving in-memory
 for items vectors. i.e. store all item vectors in memory, and compute user
 * item score on-demand. In most applications only a small proportion of
 users are active, so really you don't need all 10m user vectors in memory.
 They could be looked up from a K-V store and have an LRU cache in memory
 for say 1m of those. Optionally also update them as feedback comes in.

 As far as I can see, this is pretty much what velox does except it
 partitions all user vectors across nodes to scale.

 Oryx does almost the same but Oryx1 kept all user and item vectors in
 memory (though I am not sure about whether Oryx2 still stores all user and
 item vectors in memory or partitions in some way).

 Deb, we are using a custom Akka-based model server (with Scalatra
 frontend). It is more focused on many small models in-memory (largest of
 these is around 5m user vectors, 100k item vectors, with factor size
 20-50). We use Akka cluster sharding to allow scale-out across nodes if
 required. We have a few hundred models comfortably powered by m3.xlarge AWS
 instances. Using floats you could probably have all of your factors in
 memory on one 64GB machine (depending on how many models you have).

 Our solution is not that generic and a little hacked-together - but I'd be
 happy to chat offline about sharing what we've done. I think it still has a
 basic client to the Spark JobServer which would allow triggering
 re-computation jobs periodically. We currently just run batch
 re-computation and reload factors from S3 periodically.

 We then use Elasticsearch to post-filter results and blend content-based
 stuff - which I think might be more efficient than SparkSQL for this
 particular purpose.

 On Wed, Jun 24, 2015 at 8:59 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Model sizes are 10m x rank, 100k x rank range.

 For recommendation/topic modeling I can run batch recommendAll and then
 keep serving the model using a distributed cache but then I can't
 incorporate per user model re-predict if user feedback is making the
 current topk stale. I have to wait for next batch refresh which might be 1
 hr away.

 spark job server + spark sql can get me fresh updates but each time
 running a predict might be slow.

 I am guessing the better idea might be to start with batch recommendAll
 and then update the per user model if it get stale but that needs acess to
 the key value store and the model over a API like spark job server. I am
 running experiments with job server. In general it will be nice if my key
 value store and model are both managed by same akka based API.

 Yes sparksql is to filter/boost recommendation results using business
 logic like user demography for example..
 On Jun 23, 2015 2:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes, and typically needs are 100ms. Now imagine even 10 concurrent
 requests. My experience has been that this approach won't nearly
 scale. The best you could probably do is async mini-batch
 near-real-time scoring, pushing results to some store for retrieval,
 which could be entirely suitable for your use case.

 On Tue, Jun 23, 2015 at 8:52 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  If your recommendation needs are real-time (1s) I am not sure job
 server
  and computing the refs with spark will do the trick (though those new
  BLAS-based methods may have given sufficient speed up).





Re: Velox Model Server

2015-06-22 Thread Debasish Das
Models that I am looking for are mostly factorization based models (which
includes both recommendation and topic modeling use-cases).
For recommendation models, I need a combination of Spark SQL and ml model
prediction api...I think spark job server is what I am looking for and it
has fast http rest backend through spray which will scale fine through akka.

Out of curiosity why netty?
What model are you serving?
Velox doesn't look like it is optimized for cases like ALS recs, if that's
what you mean. I think scoring ALS at scale in real time takes a fairly
different approach.
The servlet engine probably doesn't matter at all in comparison.

On Sat, Jun 20, 2015, 9:40 PM Debasish Das debasish.da...@gmail.com wrote:

 After getting used to Scala, writing Java is too much work :-)

 I am looking for scala based project that's using netty at its core (spray
 is one example).

 prediction.io is an option but that also looks quite complicated and not
 using all the ML features that got added in 1.3/1.4

 Velox built on top of ML / Keystone ML pipeline API and that's useful but
 it is still using javax servlets which is not netty based.

 On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server
 component at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or
 it uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build
 upon.

 Thanks.
 Deb



 --
 - Charles







Velox Model Server

2015-06-20 Thread Debasish Das
Hi,

The demo of end-to-end ML pipeline including the model server component at
Spark Summit was really cool.

I was wondering if the Model Server component is based upon Velox or it
uses a completely different architecture.

https://github.com/amplab/velox-modelserver

We are looking for an open source version of model server to build upon.

Thanks.
Deb


Re: Velox Model Server

2015-06-20 Thread Debasish Das
Integration of model server with ML pipeline API.

On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto don...@prediction.io wrote:

 Mind if I ask what 1.3/1.4 ML features that you are looking for?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote:

 After getting used to Scala, writing Java is too much work :-)

 I am looking for scala based project that's using netty at its core
 (spray is one example).

 prediction.io is an option but that also looks quite complicated and not
 using all the ML features that got added in 1.3/1.4

 Velox built on top of ML / Keystone ML pipeline API and that's useful but
 it is still using javax servlets which is not netty based.

 On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server
 component at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or
 it uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build
 upon.

 Thanks.
 Deb



 --
 - Charles






 --
 Donald Szeto
 PredictionIO




Re: Velox Model Server

2015-06-20 Thread Debasish Das
After getting used to Scala, writing Java is too much work :-)

I am looking for scala based project that's using netty at its core (spray
is one example).

prediction.io is an option but that also looks quite complicated and not
using all the ML features that got added in 1.3/1.4

Velox built on top of ML / Keystone ML pipeline API and that's useful but
it is still using javax servlets which is not netty based.

On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server component
 at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or it
 uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build upon.

 Thanks.
 Deb



 --
 - Charles






Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.

On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:

 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
 dgemm based calculation.

 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine. In
 my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper caching
 and use of broadcast variables this will take a while defending on the size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.

On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine. In
 my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper caching
 and use of broadcast variables this will take a while defending on the size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited. If you
 have received this communication in error, please contact the sender and
 delete the material from your computer.





 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++





Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:

https://issues.apache.org/jira/browse/SPARK-4823

On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Also not sure how threading helps here because Spark puts a partition to
 each core. On each core may be there are multiple threads if you are using
 intel hyperthreading but I will let Spark handle the threading.

 On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
 dgemm based calculation.

 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine.
 In my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper 
 caching
 and use of broadcast variables this will take a while defending on the 
 size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about
 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of
interesting features as well...
On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote:

 We don't have it in MLlib. The closest would be the ChiSqSelector,
 which works for categorical data. -Xiangrui

 On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:
  What would be closest equivalent in MLLib to Oracle Data Miner's
 Attribute
  Importance mining function?
 
 
 http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
 
  Attribute importance is a supervised function that ranks attributes
  according to their significance in predicting a target.
 
 
  Best regards,
  Ruslan Dautkhanov

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem
scale permits as there are guarantees on opti mization...owlqn and bfgs are
both quasi newton

Most single node code bases will run quasi newton solvesif you are
using sgd better is to use adadelta/adagrad or similar tricks...David added
some of them in breeze recently...
On Jun 9, 2015 7:25 PM, DB Tsai dbt...@dbtsai.com wrote:

 As Robin suggested, you may try the following new implementation.


 https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef

 Thanks.

 Sincerely,

 DB Tsai
 --
 Blog: https://www.dbtsai.com
 PGP Key ID: 0xAF08DF8D
 https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D

 On Tue, Jun 9, 2015 at 3:22 PM, Robin East robin.e...@xense.co.uk wrote:

 Hi Stephen

 How many is a very large number of iterations? SGD is notorious for
 requiring 100s or 1000s of iterations, also you may need to spend some time
 tweaking the step-size. In 1.4 there is an implementation of ElasticNet
 Linear Regression which is supposed to compare favourably with an
 equivalent R implementation.
  On 9 Jun 2015, at 22:05, Stephen Carman scar...@coldlight.com wrote:
 
  Hi User group,
 
  We are using spark Linear Regression with SGD as the optimization
 technique and we are achieving very sub-optimal results.
 
  Can anyone shed some light on why this implementation seems to produce
 such poor results vs our own implementation?
 
  We are using a very small dataset, but we have to use a very large
 number of iterations to achieve similar results to our implementation,
 we’ve tried normalizing the data
  not normalizing the data and tuning every param. Our implementation is
 a closed form solution so we should be guaranteed convergence but the spark
 one is not, which is
  understandable, but why is it so far off?
 
  Has anyone experienced this?
 
  Steve Carman, M.S.
  Artificial Intelligence Engineer
  Coldlight-PTC
  scar...@coldlight.com
  This e-mail is intended solely for the above-mentioned recipient and it
 may contain confidential or privileged information. If you have received it
 in error, please notify us immediately and delete the e-mail. You must not
 copy, distribute, disclose or take any action in reliance on it. In
 addition, the contents of an attachment to this e-mail may contain software
 viruses which could damage your own computer system. While ColdLight
 Solutions, LLC has taken every reasonable precaution to minimize this risk,
 we cannot accept liability for any damage which you sustain as a result of
 software viruses. You should perform your own virus checks before opening
 the attachment.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Is there an existing way in SparkML to convert a decision tree to a
 decision list?

 On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh r...@databricks.com wrote:

 The closest algorithm to decision lists that we have is decision trees
 https://spark.apache.org/docs/latest/mllib-decision-tree.html

 On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 I have used weka machine learning library for generating a model for my
 training set. I have used the PART algorithm (decision lists) from weka.

 Now, I would like to use spark ML for the PART algo for my training set
 and could not seem to find a parallel. Could anyone point out the
 corresponding algorithm or even if its available in Spark ML?

 Thanks,
 Sateesh






Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further...

On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk 
guoqing0...@yahoo.com.hk wrote:

 Hive on Spark and SparkSQL which should be better , and what are the key
 characteristics and the advantages and the disadvantages between ?

 --
 guoqing0...@yahoo.com.hk



Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your
query points can fit in memory there is broadcast version which we are
experimenting with internallywe are using brute force KNN right now in
the PR...based on flann paper lsh did not work well but before you go to
approximate knn you have to make sure your topk precision/recall is not
degrading as compared to brute force in your cv flow...

I have not yet extracted knn model but that will use the IndexedRowMatrix
changes that we put in the PR
On May 19, 2015 12:58 PM, Xiangrui Meng men...@gmail.com wrote:

 Spark SQL doesn't provide spatial features. Large-scale KNN is usually
 combined with locality-sensitive hashing (LSH). This Spark package may
 be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash.
 -Xiangrui

 On Sat, May 9, 2015 at 9:25 PM, Dong Li lid...@lidong.net.cn wrote:
  Hello experts,
 
  I’m new to Spark, and want to find K nearest neighbors on huge scale
 high-dimension points dataset in very short time.
 
  The scenario is: the dataset contains more than 10 million points, whose
 dimension is 200d. I’m building a web service, to receive one new point at
 each request and return K nearest points inside that dataset, also need to
 ensure the time-cost not very high. I have a cluster with several
 high-memory nodes for this service.
 
  Currently I only have these ideas here:
  1. To create several ball-tree instances in each node when service
 initializing. This is fast, but not perform well at data scaling ability. I
 cannot insert new nodes to the ball-trees unless I restart the services and
 rebuild them.
  2. To use sql based solution. Some database like PostgreSQL and
 SqlServer have features on spatial search. But these database may not
 perform well in big data environment. (Does SparkSQL have Spatial features
 or spatial index?)
 
  Based on your experience, can I achieve this scenario in Spark SQL? Or
 do you know other projects in Spark stack acting well for this?
  Any ideas are appreciated, thanks very much.
 
  Regards,
  Dong
 
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through
application specific logic (topK etc) you can cut the shuffle space...Also
most likely the brute force approach will be a benchmark tool to see how
better is your clustering based KNN solution since there are several ways
you can find approximate nearest neighbors for your application
(KMeans/KDTree/LSH etc)...

There is a variant that I will bring as a PR for this JIRA and we will of
course look into how to improve it further...the idea is to think about
distributed matrix multiply where both matrices A and B are distributed and
master coordinates pulling a partition of A and multiply it with B...

The idea suffices for kernel matrix generation as well if the number of
rows are modest (~10M or so)...

https://issues.apache.org/jira/browse/SPARK-4823


On Wed, Apr 29, 2015 at 3:25 PM, ayan guha guha.a...@gmail.com wrote:

 This is my first thought, please suggest any further improvement:
 1. Create a rdd of your dataset
 2. Do an cross join to generate pairs
 3. Apply reducebykey and compute distance. You will get a rdd with
 keypairs and distance

 Best
 Ayan
 On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl wrote:

 Dear Sparkers,

 I am working on an algorithm which requires the pair distance between all
 points (eg. DBScan, LOF, etc.). Computing this for *n* points will
 require produce a n^2 matrix. If the distance measure is symmetrical, this
 can be reduced to (n^2)/2. What would be the most optimal way of computing
 this?

 The paper *Pairwise Element Computation with MapReduce
 https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf* paper
 describes different approaches to optimize this process within a map-reduce
 model. Although I don't believe this is applicable to Spark. How would you
 guys approach this?

 I first thought about broadcasting the original points to all the
 workers, and then compute the distances across the different workers.
 Although this requires all the points to be distributed across all the
 machines. But this feels rather brute-force, what do you guys think.

 I don't expect full solutions, but some pointers would be great. I think
 a good solution can be re-used for many algorithms.

 Kind regards,
 Fokko Driesprong




Re: Benchmaking col vs row similarities

2015-04-10 Thread Debasish Das
I will increase memory for the job...that will also fix it right ?
On Apr 10, 2015 12:43 PM, Reza Zadeh r...@databricks.com wrote:

 You should pull in this PR: https://github.com/apache/spark/pull/5364
 It should resolve that. It is in master.
 Best,
 Reza

 On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 I am benchmarking row vs col similarity flow on 60M x 10M matrices...

 Details are in this JIRA:

 https://issues.apache.org/jira/browse/SPARK-4823

 For testing I am using Netflix data since the structure is very similar:
 50k x 17K near dense similarities..

 Items are 17K and so I did not activate threshold in colSimilarities yet
 (it's at 1e-4)

 Running Spark on YARN with 20 nodes, 4 cores, 16 gb, shuffle threshold 0.6

 I keep getting these from col similarity code from 1.2 branch. Should I
 use Master ?

 15/04/10 11:08:36 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(5, tblpmidn36adv-hdp.tdc.vzwcorp.com, 44410) with no
 recent heart beats: 50315ms exceeds 45000ms

 15/04/10 11:09:12 ERROR ContextCleaner: Error cleaning broadcast 1012

 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]

 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)

 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)

 at scala.concurrent.Await$.result(package.scala:107)

 at
 org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)

 at
 org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)

 at
 org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)

 at
 org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)

 at
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)

 at org.apache.spark.ContextCleaner.org
 $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)

 at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)

 I knew how to increase the 45 ms to something higher as it is compute
 heavy job but in YARN, I am not sure how to set that config..

 But in any-case that's a warning and should not affect the job...

 Any idea how to improve the runtime other than increasing threshold to
 1e-2 ? I will do that next

 Was netflix dataset benchmarked for col based similarity flow before ?
 similarity output from this dataset becomes near dense and so it is
 interesting for stress testing...

 Thanks.

 Deb





RDD union

2015-04-09 Thread Debasish Das
Hi,

I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...

While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...

Is there any checkpoint based option to further speed up the union ?

Thanks.
Deb


Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating
on internal datasets..this code will work on matrix factors and sparse
matrices that has rows = 100* columnsif columns are much smaller than
rows then col based flow works well...basically we need both flows...

I did not think on random sampling yet but LSH will work well...metric is
the key here and so every optimization needs to be validated wrt the raw
flow..
On Apr 6, 2015 10:15 AM, Reza Zadeh r...@databricks.com wrote:

 Right now dimsum is meant to be used for tall and skinny matrices, and so
 columnSimilarities() returns similar columns, not rows. We are working on
 adding an efficient row similarity as well, tracked by this JIRA:
 https://issues.apache.org/jira/browse/SPARK-4823
 Reza

 On Mon, Apr 6, 2015 at 6:08 AM, James alcaid1...@gmail.com wrote:

 The example below illustrates how to use the DIMSUM algorithm to
 calculate the similarity between each two rows and output row pairs with
 cosine simiarity that is not less than a threshold.


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala


 But what if I hope to hold an Id of each row, which means the input file
 is:

 id1 vector1
 id2 vector2
 id3 vector3
 ...

 And we hope to output

 id1 id2 sim(id1, id2)
 id1 id3 sim(id1, id3)
 ...


 Alcaid





Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
Cool !

You should also consider to contribute it back to spark if you are doing
quantile calculations for example...there is also topbykey api added in
master by @coderxiangsee if you can use that API to make the code
clean
On Apr 3, 2015 5:20 AM, Aung Htet aung@gmail.com wrote:

 Hi Debasish, Charles,

 I solved the problem by using a BPQ like method, based on your
 suggestions. So thanks very much for that!

 My approach was
 1) Count the population of each segment in the RDD by map/reduce so that I
 get the bound number N equivalent to 10% of each segment. This becomes the
 size of the BPQ.
 2) Associate the bounds N to the corresponding records in the first RDD.
 3) Reduce the RDD from step 2 by merging the values in every two rows,
 basically creating a sorted list (Indexed Seq)
 4) If the size of the sorted list is greater than N (the bound) then,
 create a new sorted list by using a priority queue and dequeuing top N
 values.

 In the end, I get a record for each segment with N max values for each
 segment.

 Regards,
 Aung








 On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 In that case you can directly use count-min-sketch from algebirdthey
 work fine with Spark aggregateBy but I have found the java BPQ from Spark
 much faster than say algebird Heap datastructure...

 On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden 
 charles.hay...@atigeo.com wrote:

  ​You could also consider using a count-min data structure such as in
 https://github.com/laserson/dsq​

 to get approximate quantiles, then use whatever values you want to
 filter the original sequence.
  --
 *From:* Debasish Das debasish.da...@gmail.com
 *Sent:* Thursday, March 26, 2015 9:45 PM
 *To:* Aung Htet
 *Cc:* user
 *Subject:* Re: How to get a top X percent of a distribution represented
 as RDD

  Idea is to use a heap and get topK elements from every
 partition...then use aggregateBy and for combOp do a merge routine from
 mergeSort...basically get 100 items from partition 1, 100 items from
 partition 2, merge them so that you get sorted 200 items and take 100...for
 merge you can use heap as well...Matei had a BPQ inside Spark which we use
 all the time...Passing arrays over wire is better than passing full heap
 objects and merge routine on array should run faster but needs experiment...

 On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do
 not quite understand how you can use aggregateBy to get 10% top K elements.
 Can you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

  I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

  A naive algorithm would be -

  1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10%
 cut off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate
 if someone can suggest a better way to implement this in Spark.

  Regards,
 Aung









Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each
partition and use merge from any sort algorithm like timsortbasically
aggregateBy

Your version uses shuffle but this version is 0 shuffle..assuming your data
set is cached you will be using in-memory allReduce through treeAggregate...

But this is only good for top 10% or bottom 10%...if you need to do it for
top 30% then may be the shuffle version will work better...

On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

 I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores. This
 seems hard to do in Spark RDD.

 A naive algorithm would be -

 1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off
 out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate if
 someone can suggest a better way to implement this in Spark.

 Regards,
 Aung



Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
Idea is to use a heap and get topK elements from every partition...then use
aggregateBy and for combOp do a merge routine from mergeSort...basically
get 100 items from partition 1, 100 items from partition 2, merge them so
that you get sorted 200 items and take 100...for merge you can use heap as
well...Matei had a BPQ inside Spark which we use all the time...Passing
arrays over wire is better than passing full heap objects and merge routine
on array should run faster but needs experiment...

On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do not
 quite understand how you can use aggregateBy to get 10% top K elements. Can
 you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

 Your version uses shuffle but this version is 0 shuffle..assuming your
 data set is cached you will be using in-memory allReduce through
 treeAggregate...

 But this is only good for top 10% or bottom 10%...if you need to do it
 for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

 I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

 A naive algorithm would be -

 1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
 off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate if
 someone can suggest a better way to implement this in Spark.

 Regards,
 Aung






Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098

Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
this for 1.4...this is useful in general for cross validating on prec@k
measure to tune the model...

Right now it uses level 1 blas but the next extension is to use level 3
blas to further make the compute faster...
 On Mar 18, 2015 6:48 AM, Sean Owen so...@cloudera.com wrote:

 I don't think that you need memory to put the whole joined data set in
 memory. However memory is unlikely to be the limiting factor, it's the
 massive shuffle.

 OK, you really do have a large recommendation problem if you're
 recommending for at least 7M users per day!

 My hunch is that it won't be fast enough to use the simple predict()
 or recommendProducts() method repeatedly. There was a proposal to make
 a recommendAll() method which you could crib
 (https://issues.apache.org/jira/browse/SPARK-3066) but that looks like
 still a work in progress since the point there was to do more work to
 make it possibly scale.

 You may consider writing a bit of custom code to do the scoring. For
 example cache parts of the item-factor matrix in memory on the workers
 and score user feature vectors in bulk against them.

 There's a different school of though which is to try to compute only
 what you need, on the fly, and cache it if you like. That is good in
 that it doesn't waste effort and makes the result fresh, but, of
 course, means creating or consuming some other system to do the
 scoring and getting *that* to run fast.

 You can also look into techniques like LSH for probabilistically
 guessing which tiny subset of all items are worth considering, but
 that's also something that needs building more code.

 I'm sure a couple people could chime in on that here but it's kind of
 a separate topic.

 On Wed, Mar 18, 2015 at 8:04 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Thanks much for your reply.
 
  By saying on the fly, you mean caching the trained model, and querying it
  for each user joined with 30M products when needed?
 
  Our question is more about the general approach, what if we have 7M DAU?
  How the companies deal with that using Spark?
 
 
  On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen so...@cloudera.com wrote:
 
  Not just the join, but this means you're trying to compute 600
  trillion dot products. It will never finish fast. Basically: don't do
  this :) You don't in general compute all recommendations for all
  users, but recompute for a small subset of users that were or are
  likely to be active soon. (Or compute on the fly.) Is anything like
  that an option?
 
  On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
  aram.mkrtchyan...@gmail.com wrote:
   Trying to build recommendation system using Spark MLLib's ALS.
  
   Currently, we're trying to pre-build recommendations for all users on
   daily
   basis. We're using simple implicit feedbacks and ALS.
  
   The problem is, we have 20M users and 30M products, and to call the
 main
   predict() method, we need to have the cartesian join for users and
   products,
   which is too huge, and it may take days to generate only the join. Is
   there
   a way to avoid cartesian join to make the process faster?
  
   Currently we have 8 nodes with 64Gb of RAM, I think it should be
 enough
   for
   the data.
  
   val users: RDD[Int] = ???   // RDD with 20M userIds
   val products: RDD[Int] = ???// RDD with 30M productIds
   val ratings : RDD[Rating] = ??? // RDD with all user-product
   feedbacks
  
   val model = new ALS().setRank(10).setIterations(10)
 .setLambda(0.0001).setImplicitPrefs(true)
 .setAlpha(40).run(ratings)
  
   val usersProducts = users.cartesian(products)
   val recommendations = model.predict(usersProducts)
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
you can apply in reduceByKey (you have to change the code to use
combineByKey and add your filters before shuffling the keys to reducer)...

The other variant that you are mentioning is row based similarity flow
which is tracked in the following JIRA where I am interesting in doing no
shuffle but use broadcast and mapPartitions. I will open up the PR soon but
it is compute intensive and I am experimenting with BLAS optimizations...

https://issues.apache.org/jira/browse/SPARK-4823

Your case of 100 x 5 million (tranpose of it) for example is very common in
matrix factorization where you have user factors and product factors which
will typically be 5 million x 100 dense matrix and you want to compute
user-user and item-item similarities...

You are right that sparsity helps but you can't apply sparsity (for example
pick topK) before doing the dot products...so it is still a compute
intensive operation...

On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 ​Hi Reza
 ​​
 I see that ((int, int), double) pairs are generated for any combination
 that meets the criteria controlled by the threshold. But assuming a simple
 1x10K matrix that means I would need atleast 12GB memory per executor for
 the flat map just for these pairs excluding any other overhead. Is that
 correct? How can we make this scale for even larger n (when m stays small)
 like 100 x 5 million.​ One is by using higher thresholds. The other is that
 I use a SparseVector to begin with. Are there any other optimizations I can
 take advantage of?

 ​Thanks
 Sab




Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m
and I made sparsity pattern 100:1.5M..we would like to increase the
sparsity pattern to 1000:1.5M

I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort
shuffle will help..

I read in Reza paper that oversample works only if cols are skinny so I am
not very keen to oversample...
 On Feb 17, 2015 2:01 PM, Xiangrui Meng men...@gmail.com wrote:

 The complexity of DIMSUM is independent of the number of rows but
 still have quadratic dependency on the number of columns. 1.5M columns
 may be too large to use DIMSUM. Try to increase the threshold and see
 whether it helps. -Xiangrui

 On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am running brute force similarity from RowMatrix on a job with 5M x
 1.5M
  sparse matrix with 800M entries. With 200M entries the job run fine but
 with
  800M I am getting exceptions like too many files open and no space left
 on
  device...
 
  Seems like I need more nodes or use dimsum sampling ?
 
  I am running on 10 nodes where ulimit on each node is set at
 65K...Memory is
  not an issue since I can cache the dataset before similarity computation
  starts.
 
  I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both
 the
  jobs failed with FetchFailed msgs.
 
  Thanks.
  Deb



Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Hi Reza,

With 40 nodes and shuffle space managed by YARN over HDFS usercache we
could run the similarity job without doing any thresholding...We used hash
based shuffle and sort hopefully will further improve it...Note that this
job was almost 6M x 1.5M

We will go towards 50 M x ~ 3M columns and increase the sparsity
pattern...Dimsum configurations will definitely help over there...

With a baseline run, it will be easier for me to now run dimsum sampling
and compare the results...I will try the configs that you pointed.

Thanks.
Deb

On Wed, Feb 25, 2015 at 3:52 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you try using higher threshold values as I mentioned in an earlier
 email? Use RowMatrix.columnSimilarities(x) where x is some number ? Try
 the following values for x:
 0.1, 0.9, 10, 100

 And yes, the idea is that the matrix is skinny, you are pushing the
 boundary with 1.5m columns, because the output can potentially have 2.25 x
 10^12 entries, which is a lot. (squares 1.5m)

 Best,
 Reza


 On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5
 m and I made sparsity pattern 100:1.5M..we would like to increase the
 sparsity pattern to 1000:1.5M

 I am running 1.1 stable and I get random shuffle failures...may be 1.2
 sort shuffle will help..

 I read in Reza paper that oversample works only if cols are skinny so I
 am not very keen to oversample...
  On Feb 17, 2015 2:01 PM, Xiangrui Meng men...@gmail.com wrote:

 The complexity of DIMSUM is independent of the number of rows but
 still have quadratic dependency on the number of columns. 1.5M columns
 may be too large to use DIMSUM. Try to increase the threshold and see
 whether it helps. -Xiangrui

 On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am running brute force similarity from RowMatrix on a job with 5M x
 1.5M
  sparse matrix with 800M entries. With 200M entries the job run fine
 but with
  800M I am getting exceptions like too many files open and no space
 left on
  device...
 
  Seems like I need more nodes or use dimsum sampling ?
 
  I am running on 10 nodes where ulimit on each node is set at
 65K...Memory is
  not an issue since I can cache the dataset before similarity
 computation
  starts.
 
  I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable.
 Both the
  jobs failed with FetchFailed msgs.
 
  Thanks.
  Deb





Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi Sean,

This is what I intend to do:

are you saying that you know a key should be filtered based on its value
partway through the merge?

I should use combineByKey...

Thanks.
Deb


On Thu, Feb 19, 2015 at 6:31 AM, Sean Owen so...@cloudera.com wrote:

 You have the keys before and after reduceByKey. You want to do
 something based on the key within reduceByKey? it just calls
 combineByKey, so you can use that method for lower-level control over
 the merging.

 Whether it's possible depends I suppose on what you mean to filter on.
 If it's just a property of the key, can't you just filter before
 reduceByKey? If it's a property of the key's value, don't you need to
 wait for the reduction to finish? or are you saying that you know a
 key should be filtered based on its value partway through the merge?

 I suppose you can use combineByKey to create a mergeValue function
 that changes an input type A into some other Option[B]; you output
 None if your criteria is reached, and your combine function returns
 None if either argument is None? it doesn't save 100% of the work but
 it may mean you only shuffle (key,None) for some keys if the map-side
 combine already worked out that the key would be filtered.

 And then after, run a flatMap or something to make Option[B] into B.

 On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  Before I send out the keys for network shuffle, in reduceByKey after map
 +
  combine are done, I would like to filter the keys based on some
 threshold...
 
  Is there a way to get the key, value after map+combine stages so that I
 can
  run a filter on the keys ?
 
  Thanks.
  Deb



Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi,

Before I send out the keys for network shuffle, in reduceByKey after map +
combine are done, I would like to filter the keys based on some threshold...

Is there a way to get the key, value after map+combine stages so that I can
run a filter on the keys ?

Thanks.
Deb


Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
I thought combiner comes from reduceByKey and not mapPartitions right...Let
me dig deeper into the APIs

On Thu, Feb 19, 2015 at 8:29 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I'm not sure what your use case is, but perhaps you could use
 mapPartitions to reduce across the individual partitions and apply your
 filtering. Then you can finish with a reduceByKey.

 On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 Before I send out the keys for network shuffle, in reduceByKey after map
 + combine are done, I would like to filter the keys based on some
 threshold...

 Is there a way to get the key, value after map+combine stages so that I
 can run a filter on the keys ?

 Thanks.
 Deb




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: WARN from Similarity Calculation

2015-02-18 Thread Debasish Das
I am still debugging it but I believe if m% of users have unusually large
columns and the RDD partitioner on RowMatrix is hashPartitioner then due to
the basic algorithm without sampling, some partitions can cause unusually
large number of keys...

If my debug shows that I will add a custom partitioner for RowMatrix (will
be useful for sparse vectors, for dense vector it does not matter)...

Of course from feature engineering, we will see if we can cut off the users
with large number of columns...

On Tue, Feb 17, 2015 at 1:58 PM, Xiangrui Meng men...@gmail.com wrote:

 It may be caused by GC pause. Did you check the GC time in the Spark
 UI? -Xiangrui

 On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am sometimes getting WARN from running Similarity calculation:
 
  15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
  BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms
  exceeds 45000ms
 
  Do I need to increase the default 45 s to larger values for cases where
 we
  are doing blocked operation or long compute in the mapPartitions ?
 
  Thanks.
  Deb



Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi,

I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...

Seems like I need more nodes or use dimsum sampling ?

I am running on 10 nodes where ulimit on each node is set at 65K...Memory
is not an issue since I can cache the dataset before similarity computation
starts.

I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both the
jobs failed with FetchFailed msgs.

Thanks.
Deb


WARN from Similarity Calculation

2015-02-15 Thread Debasish Das
Hi,

I am sometimes getting WARN from running Similarity calculation:

15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms
exceeds 45000ms

Do I need to increase the default 45 s to larger values for cases where we
are doing blocked operation or long compute in the mapPartitions ?

Thanks.
Deb


Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
I thought more on it...can we provide access to the IndexedRDD through
thriftserver API and let the mapPartitions query the API ? I am not sure if
ThriftServer is as performant as opening up an API using other akka based
frameworks (like play or spray)...

Any pointers will be really helpful...

Neither play nor spray is being used in Spark right nowso it brings
dependencies and we already know about the akka conflicts...thriftserver on
the other hand is already integrated for JDBC access


On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Also I wanted to run get() and set() from mapPartitions (from spark
 workers and not master)...

 To be able to do that I think I have to create a separate spark context
 for the cache...

 But I am not sure how SparkContext from job1 can access SparkContext from
 job2 !


 On Tue, Feb 10, 2015 at 3:25 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Thanks...this is what I was looking for...

 It will be great if Ankur can give brief details about it...Basically how
 does it contrast with memcached for example...

 On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You should look at https://github.com/amplab/spark-indexedrdd

 On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Michael,

 I want to cache a RDD and define get() and set() operators on it.
 Basically like memcached. Is it possible to build a memcached like
 distributed cache using Spark SQL ? If not what do you suggest we should
 use for such operations...

 Thanks.
 Deb

 On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 You can do insert into.  As with other SQL on HDFS systems there is no
 updating of data.
 On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Is this what you are looking for?


 https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

 According to the doc, it says Operator that acts as a sink for
 queries on RDDs and can be used to store the output inside a directory of
 Parquet files. This operator is similar to Hive's INSERT INTO TABLE
 operation in the sense that one can choose to either overwrite or append 
 to
 a directory. Note that consecutive insertions to the same table must have
 compatible (source) schemas.

 Thanks
 Best Regards


 On Thu, Jul 17, 2014 at 11:42 AM, Hu, Leo leo.h...@sap.com wrote:

  Hi

As for spark 1.0, can we insert and update a table with SPARK
 SQL, and how?



 Thanks

 Best Regard









Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael,

I want to cache a RDD and define get() and set() operators on it. Basically
like memcached. Is it possible to build a memcached like distributed cache
using Spark SQL ? If not what do you suggest we should use for such
operations...

Thanks.
Deb

On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust mich...@databricks.com
wrote:

 You can do insert into.  As with other SQL on HDFS systems there is no
 updating of data.
 On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Is this what you are looking for?


 https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

 According to the doc, it says Operator that acts as a sink for queries
 on RDDs and can be used to store the output inside a directory of Parquet
 files. This operator is similar to Hive's INSERT INTO TABLE operation in
 the sense that one can choose to either overwrite or append to a directory.
 Note that consecutive insertions to the same table must have compatible
 (source) schemas.

 Thanks
 Best Regards


 On Thu, Jul 17, 2014 at 11:42 AM, Hu, Leo leo.h...@sap.com wrote:

  Hi

As for spark 1.0, can we insert and update a table with SPARK SQL,
 and how?



 Thanks

 Best Regard





Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Thanks...this is what I was looking for...

It will be great if Ankur can give brief details about it...Basically how
does it contrast with memcached for example...

On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com
wrote:

 You should look at https://github.com/amplab/spark-indexedrdd

 On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Michael,

 I want to cache a RDD and define get() and set() operators on it.
 Basically like memcached. Is it possible to build a memcached like
 distributed cache using Spark SQL ? If not what do you suggest we should
 use for such operations...

 Thanks.
 Deb

 On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You can do insert into.  As with other SQL on HDFS systems there is no
 updating of data.
 On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Is this what you are looking for?


 https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

 According to the doc, it says Operator that acts as a sink for
 queries on RDDs and can be used to store the output inside a directory of
 Parquet files. This operator is similar to Hive's INSERT INTO TABLE
 operation in the sense that one can choose to either overwrite or append to
 a directory. Note that consecutive insertions to the same table must have
 compatible (source) schemas.

 Thanks
 Best Regards


 On Thu, Jul 17, 2014 at 11:42 AM, Hu, Leo leo.h...@sap.com wrote:

  Hi

As for spark 1.0, can we insert and update a table with SPARK SQL,
 and how?



 Thanks

 Best Regard







Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Also I wanted to run get() and set() from mapPartitions (from spark workers
and not master)...

To be able to do that I think I have to create a separate spark context for
the cache...

But I am not sure how SparkContext from job1 can access SparkContext from
job2 !


On Tue, Feb 10, 2015 at 3:25 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Thanks...this is what I was looking for...

 It will be great if Ankur can give brief details about it...Basically how
 does it contrast with memcached for example...

 On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You should look at https://github.com/amplab/spark-indexedrdd

 On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Michael,

 I want to cache a RDD and define get() and set() operators on it.
 Basically like memcached. Is it possible to build a memcached like
 distributed cache using Spark SQL ? If not what do you suggest we should
 use for such operations...

 Thanks.
 Deb

 On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 You can do insert into.  As with other SQL on HDFS systems there is no
 updating of data.
 On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Is this what you are looking for?


 https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

 According to the doc, it says Operator that acts as a sink for
 queries on RDDs and can be used to store the output inside a directory of
 Parquet files. This operator is similar to Hive's INSERT INTO TABLE
 operation in the sense that one can choose to either overwrite or append 
 to
 a directory. Note that consecutive insertions to the same table must have
 compatible (source) schemas.

 Thanks
 Best Regards


 On Thu, Jul 17, 2014 at 11:42 AM, Hu, Leo leo.h...@sap.com wrote:

  Hi

As for spark 1.0, can we insert and update a table with SPARK SQL,
 and how?



 Thanks

 Best Regard








Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das
Hi Dib,

For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.

Is writing to kafka queues from spark job supported in your code ?

Thanks
Deb
 On Jan 15, 2015 11:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

 There was a simple example
 https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
 which you can run after changing few lines of configurations.

 Thanks
 Best Regards

 On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya 
 dibyendu.bhattach...@gmail.com wrote:

 Hi Kidong,

 Just now I tested the Low Level Consumer with Spark 1.2 and I did not see
 any issue with Receiver.Store method . It is able to fetch messages form
 Kafka.

 Can you cross check other configurations in your setup like Kafka broker
 IP , topic name, zk host details, consumer id etc.

 Dib

 On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya 
 dibyendu.bhattach...@gmail.com wrote:

 Hi Kidong,

 No , I have not tried yet with Spark 1.2 yet. I will try this out and
 let you know how this goes.

 By the way, is there any change in Receiver Store method happened in
 Spark 1.2 ?



 Regards,
 Dibyendu



 On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:

 Hi Dibyendu,

 I am using kafka 0.8.1.1 and spark 1.2.0.
 After modifying these version of your pom, I have rebuilt your codes.
 But I have not got any messages from ssc.receiverStream(new
 KafkaReceiver(_props, i)).

 I have found, in your codes, all the messages are retrieved correctly,
 but
 _receiver.store(_dataBuffer.iterator())  which is spark streaming
 abstract
 class's method does not seem to work correctly.

 Have you tried running your spark streaming kafka consumer with kafka
 0.8.1.1 and spark 1.2.0 ?

 - Kidong.






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p21180.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did not rate movie Top Gun) , column similarities will consider it
as implicit 0...

If you want similar users you want to generate a m x m matrix and you are
going towards kernel matrix...The general problem is to take a m x n matrix
that has n features and increase it to m features where m  ncosine for
linear kernel and RBF for non-linear kernel...

dimsum/col similarity map-reduce is not optimized for kernel matrix
generation..you need to look into map-reduce kernel matrix
generationthis kernel matrix can then help you answer similar users,
spectral clustering and kernel regression/classification/SVM if you have
labels...

A simplification to the problem is to take your m x n matrix and run
k-Means on it which will produce cluster of users..now for each user you
can compute closest in it's cluster...that drops down complexity from
O(m*m) to O(m*c) where c is the max number of user in each cluster...


On Wed, Dec 10, 2014 at 7:39 AM, Sean Owen so...@cloudera.com wrote:

 Well, you're computing similarity of your features then. Whether it is
 meaningful depends a bit on the nature of your features and more on
 the similarity algorithm.

 On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Dear all,
 
  I'm trying to understand what is the correct use case of ColumnSimilarity
  implemented in RowMatrix.
 
  As far as I know, this function computes the similarity of a column of a
  given matrix. The DIMSUM paper says that it's efficient for large m
 (rows)
  and small n (columns). In this case the output will be a n by n matrix.
 
  Now, suppose I want to compute similarity of several users, say m =
  billions. Each users is described by a high dimensional feature vector,
 say
  n = 1. In my dataset, one row represent one user. So in that case
  computing the similarity my matrix is not the same as computing the
  similarity of all users. Then, what does it mean computing the
 similarity of
  the columns of my matrix in this case ?
 
  Best regards,
 
  Jao

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das
Hi Bui,

Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...

Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
search...

Thanks.
Deb


On Mon, Dec 8, 2014 at 3:04 PM, Bui, Tri 
tri@verizonwireless.com.invalid wrote:

 Hi,



 Is there any way to auto calculate the optimum learning rate or stepsize
 via MLLIB for SGD ?



 Thx

 tri



Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I
call it product similarity but for each product you build a node which
keeps distinct users visiting the product and two product nodes are
connected by an edge if the intersection  0...you are assuming if no one
user visits a keyword, he is not going to visit it in the future...this
graph is not for prediction but only keeps user visits...

Anyway once you have build this graph on graphx, you can do interesting
path based analysis...Pick a product and trace it's fanout to see once
people bought this product, which other product they bought etc etc..A
first stab at the analysis is to calculate the product similarities...

You can also generate naturally occurring cluster of products but then you
are partitioning the graph using spectral or other graph partitioners like
METIS...Even the adhoc analysis of product graph will give lot of useful
insights (hopefully deeper than apriori)...

On Fri, Dec 5, 2014 at 12:25 PM, Sean Owen so...@cloudera.com wrote:

 I doubt Amazon uses a priori for this, but who knows. Usually you want
 also bought functionality, which is a form of similar-item
 computation. But you don't want to favor items that are simply
 frequently purchased in general.

 You probably want to look at pairs of items that co-occur in purchase
 histories unusually frequently by looking at (log) likelihood ratios,
 which is a straightforward item similarity computation.

 On Fri, Dec 5, 2014 at 11:43 AM, Ashic Mahtab as...@live.com wrote:
  This can definitely be useful. Frequently bought together is something
  amazon does, though surprisingly, you don't get a discount. Perhaps it
 can
  lead to offering (or avoiding!) deals on frequent itemsets.
 
  This is a good resource for frequent itemsets implementations:
  http://infolab.stanford.edu/~ullman/mmds/ch6.pdf
 
  
  From: rpuj...@hortonworks.com
  Date: Fri, 5 Dec 2014 10:31:17 -0600
  Subject: Re: Market Basket Analysis
  To: so...@cloudera.com
  CC: t...@preferred.jp; user@spark.apache.org
 
 
  This is a typical use case people who buy electric razors, also tend to
 buy
  batteries and shaving gel along with it. The goal is to build a model
 which
  will look through POS records and find which product categories have
 higher
  likelihood of appearing together in given a transaction.
 
  What would you recommend?
 
  On Fri, Dec 5, 2014 at 7:21 AM, Sean Owen so...@cloudera.com wrote:
 
  Generally I don't think frequent-item-set algorithms are that useful.
  They're simple and not probabilistic; they don't tell you what sets
  occurred unusually frequently. Usually people ask for frequent item
  set algos when they really mean they want to compute item similarity
  or make recommendations. What's your use case?
 
  On Thu, Dec 4, 2014 at 8:23 PM, Rohit Pujari rpuj...@hortonworks.com
  wrote:
  Sure, I’m looking to perform frequent item set analysis on POS data set.
  Apriori is a classic algorithm used for such tasks. Since Apriori
  implementation is not part of MLLib yet, (see
  https://issues.apache.org/jira/browse/SPARK-4001) What are some other
  options/algorithms I could use to perform a similar task? If there’s no
  spoon to spoon substitute,  spoon to fork will suffice too.
 
  Hopefully this provides some clarification.
 
  Thanks,
  Rohit
 
 
 
  From: Tobias Pfeiffer t...@preferred.jp
  Date: Thursday, December 4, 2014 at 7:20 PM
  To: Rohit Pujari rpuj...@hortonworks.com
  Cc: user@spark.apache.org user@spark.apache.org
  Subject: Re: Market Basket Analysis
 
  Hi,
 
  On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com
  wrote:
 
  I'd like to do market basket analysis using spark, what're my options?
 
 
  To do it or not to do it ;-)
 
  Seriously, could you elaborate a bit on what you want to know?
 
  Tobias
 
 
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
  to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the
 reader
  of
  this message is not the intended recipient, you are hereby notified that
  any
  printing, copying, dissemination, distribution, disclosure or forwarding
  of
  this communication is strictly prohibited. If you have received this
  communication in error, please contact the sender immediately and delete
  it
  from your system. Thank You.
 
 
 
 
  --
  Rohit Pujari
  Solutions Engineer, Hortonworks
  rpuj...@hortonworks.com
  716-430-6899
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the
 reader of
  this message is not the intended recipient, you are hereby notified that
 any
  printing, copying, dissemination, distribution, disclosure or 

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master...

If you want topk for a key run map / mappartition and use a bounded
priority queue and reducebykey the queues.

I experimented with topk from algebird and bounded priority queue wrapped
over jpriority queue ( spark default)...bpq is faster

Code example is here:

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3066
On Dec 1, 2014 6:46 AM, Xuefeng Wu ben...@gmail.com wrote:

 Hi, I have a problem, it is easy in Scala code, but I can not take the top
 N from RDD as RDD.


 There are 1 Student Score, ask take top 10 age, and then take top 10
 from each age, the result is 100 records.

 The Scala code is here, but how can I do it in RDD,  *for RDD.take return
 is Array, but other RDD.*

 example Scala code:

 import scala.util.Random

 case class StudentScore(age: Int, num: Int, score: Int, name: Int)

 val scores = for {
   i - 1 to 1
 } yield {
   StudentScore(Random.nextInt(100), Random.nextInt(100), Random.nextInt(), 
 Random.nextInt())
 }


 def takeTop(scores: Seq[StudentScore], byKey: StudentScore = Int): Seq[(Int, 
 Seq[StudentScore])] = {
   val groupedScore = scores.groupBy(byKey)
.map{case (_, _scores) = 
 (_scores.foldLeft(0)((acc, v) = acc + v.score), _scores)}.toSeq
   groupedScore.sortBy(_._1).take(10)
 }

 val topScores = for {
   (_, ageScores) - takeTop(scores, _.age)
   (_, numScores) - takeTop(ageScores, _.num)
 } yield {
   numScores
 }

 topScores.size


 --

 ~Yours, Xuefeng Wu/吴雪峰  敬上




Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell:

scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
jar:/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar
http://jar/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jarorg.apache.spark.mllib.optimization.QuadraticMinimizer
100 1 1.0 0.99

For spark-shell my assumption is spark-shell -cp option should work fine

On Thu, Nov 27, 2014 at 9:15 AM, Dean Jones dean.m.jo...@gmail.com wrote:

 Hi,

 I'm trying to use the breeze library in the spark scala shell, but I'm
 running into the same problem documented here:

 http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html

 As I'm using the shell, I don't have a pom.xml, so the solution
 suggested in that thread doesn't work for me. I've tried the
 following:

 - adding commons-math3 using the --jars option
 - adding both breeze and commons-math3 using the --jar option
 - using the spark.executor.extraClassPath option on the cmd line as
 follows: --conf spark.executor.extraClassPath=commons-math3-3.2.jar

 None of these are working for me. Any thoughts on how I can get this
 working?

 Thanks,

 Dean.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...

reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in reduceByKey if you have access to the key in
reduceByKey/aggregateByKey...

I did not have the need to access the key in reduceByKey/aggregateByKey yet
but there should be a way...

On Tue, Nov 18, 2014 at 7:24 AM, Yanbo yanboha...@gmail.com wrote:

 First use groupByKey(), you get a tuple RDD with
 (key:K,value:ArrayBuffer[V]).
 Then use map() on this RDD with a function has different operations
 depending on the key which act as a parameter of this function.


  在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道:
 
  Hello everyone,
 
  I'm new to Spark and I have the following problem:
 
  I have this large JavaRDDMyClass collection, which I group with by
  creating a hashcode from some fields in MyClass:
 
  JavaRDDMyClass collection = ...;
  JavaPairRDDInteger, Iterablelt;MyClass grouped =
  collection.groupBy(...); // the group-function is just creating a
 hashcode
  from some fields in MyClass.
 
  Now I want to reduce the variable grouped. However, I want to reduce it
 with
  different functions depending on the key in the JavaPairRDD. So
 basically a
  reduceByKey but with multiple functions.
 
  Only solution I've come up with is by filtering grouped for each reduce
  function and apply it on the filtered  subsets. This feels kinda hackish
  though.
 
  Is there a better way?
 
  Best regards,
  Johannes
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...

Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of RDD[String] before you can run
zipWithIndex...

indices are within 0 till maxIndex-1...if you want 1 you have to later map
the index to index + 1

On Tue, Nov 18, 2014 at 8:56 AM, Blind Faith person.of.b...@gmail.com
wrote:

 As it is difficult to explain this, I would show what I want. Lets us say,
 I have an RDD A with the following value

 A = [word1, word2, word3]

 I want to have an RDD with the following value

 B = [(1, word1), (2, word2), (3, word3)]

 That is, it gives a unique number to each entry as a key value. Can we do
 such thing with Python or Scala?



Re: Spark on YARN

2014-11-18 Thread Debasish Das
I run my Spark on YARN jobs as:

HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar

It uses HADOOP_CONF_DIR to schedule executors and I get the number I ask
for (assuming other MapReduce jobs are not taking the cluster)...

Large memory intensive jobs like ALS still get issues on YARN but simple
jobs run fine...

Mine is also internal CDH cluster...

On Tue, Nov 18, 2014 at 10:03 AM, Alan Prando a...@scanboo.com.br wrote:

 Hi Folks!

 I'm running Spark on YARN cluster installed with Cloudera Manager Express.
 The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
 RAM.

 My spark's job is working fine, however it seems that just 2 of 3 slaves
 are working (htop shows 2 slaves working 100% on 32 cores, and 1 slaves
 without any processing).

 I'm using this command:
 ./spark-submit --master yarn --num-executors 3 --executor-cores 32
  --executor-memory 32g feature_extractor.py -r 390

 Additionaly, spark's log testify communications with 2 slaves only:
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-180.ec2.internal:33177/user/Executor#-113177469]
 with ID 1
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-180.ec2.internal to /default
 14/11/18 17:19:38 INFO YarnClientSchedulerBackend: Registered executor:
 Actor[akka.tcp://sparkExecutor@ip-172-31-13-179.ec2.internal:51859/user/Executor#-323896724]
 with ID 2
 14/11/18 17:19:38 INFO RackResolver: Resolved
 ip-172-31-13-179.ec2.internal to /default
 14/11/18 17:19:38 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-180.ec2.internal:50959 with 16.6 GB RAM
 14/11/18 17:19:39 INFO BlockManagerMasterActor: Registering block manager
 ip-172-31-13-179.ec2.internal:53557 with 16.6 GB RAM
 14/11/18 17:19:51 INFO YarnClientSchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after waiting
 maxRegisteredResourcesWaitingTime: 3(ms)

 Is there a configuration to call spark's job on YARN cluster with all
 slaves?

 Thanks in advance! =]

 ---
 Regards
 Alan Vidotti Prando.





Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
mapPartitions tried to hold data is memory which did not work for me..

I am doing flatMap followed by groupByKey now with HashPartitioner and
number of blocks is 60 (Based on 120 cores I am running the job on)...

Now when the shuffle size  100 GB it works fine...as flatMap shuffle goes
to 200 GB, 400 GB...I am getting:

FetchFailed(BlockManagerId(1, istgbd013.verizon.com, 44377, 0),
shuffleId=37, mapId=8, reduceId=54)

I have to shuffle because the memory on cluster is less than the shuffle
size of 400 GB..

The job runs fine if I sample and decrease my shuffle size within 100 GB..

Does groupByKey does a combiner similar to reduceByKey and aggregateByKey ?
I need a combiner operation to do some work on map side after flatMap
followed by rest of the work on reducers..

On Wed, Nov 12, 2014 at 8:35 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 flatmap would have to shuffle data only if output RDD is expected to be
 partitioned by some key.
 RDD[X].flatmap(X=RDD[Y])
 If it has to shuffle it should be local.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi


 On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 I am doing a flatMap followed by mapPartitions to do some blocked
 operation...flatMap is shuffling data but this shuffle is strictly
 shuffling to disk and not over the network right ?

 Thanks.
 Deb





Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi,

If I look inside algebird Monoid implementation it uses
java.io.Serializable...

But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't
see a KryoRegistrator for CMS and HLL monoid...

In these examples we will run with Kryo serialization on CMS and HLL or
they will be java serialized ?

Thanks.
Deb


flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi,

I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?

Thanks.
Deb


Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
I reproduced the problem in mllib tests ALSSuite.scala using the following
functions:

val arrayPredict = userProductsRDD.map{case(user,product) =

 val recommendedProducts = model.recommendProducts(user, products)

 val productScore = recommendedProducts.find{x=x.product == product
}

  require(productScore != None)

  productScore.get

}.collect

arrayPredict.foreach { elem =

  if (allRatings.get(elem.user, elem.product) != elem.rating)

  fail(Prediction APIs don't match)

}

If the usage of model.recommendProducts is correct, the test fails with the
same error I sent before...

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
316.0 (TID 79, localhost): scala.MatchError: null

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)

It is a blocker for me and I am debugging it. I will open up a JIRA if this
is indeed a bug...

Do I have to cache the models to make userFeatures.lookup(user).head to
work ?

On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Was user presented in training? We can put a check there and return
 NaN if the user is not included in the model. -Xiangrui

 On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
 but
  the code fails on userFeatures.lookup(user).head
 
  In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
  called and in all the test-cases that API has been used...
 
  I can perhaps refactor my code to do the same but I was wondering whether
  people test the lookup(user) version of the code..
 
  Do I need to cache the model to make it work ? I think right now default
 is
  STORAGE_AND_DISK...
 
  Thanks.
  Deb



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
model.recommendProducts can only be called from the master then ? I have a
set of 20% users on whom I am performing the test...the 20% users are in a
RDD...if I have to collect them all to master node and then call
model.recommendProducts, that's a issue...

Any idea how to optimize this so that we can calculate MAP statistics on
large samples of data ?


On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote:

 ALS model contains RDDs. So you cannot put `model.recommendProducts`
 inside a RDD closure `userProductsRDD.map`. -Xiangrui

 On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  I reproduced the problem in mllib tests ALSSuite.scala using the
 following
  functions:
 
  val arrayPredict = userProductsRDD.map{case(user,product) =
 
   val recommendedProducts = model.recommendProducts(user,
 products)
 
   val productScore = recommendedProducts.find{x=x.product ==
  product}
 
require(productScore != None)
 
productScore.get
 
  }.collect
 
  arrayPredict.foreach { elem =
 
if (allRatings.get(elem.user, elem.product) != elem.rating)
 
fail(Prediction APIs don't match)
 
  }
 
  If the usage of model.recommendProducts is correct, the test fails with
 the
  same error I sent before...
 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in
  stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
  316.0 (TID 79, localhost): scala.MatchError: null
 
  org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
 
  It is a blocker for me and I am debugging it. I will open up a JIRA if
 this
  is indeed a bug...
 
  Do I have to cache the models to make userFeatures.lookup(user).head to
 work
  ?
 
 
  On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Was user presented in training? We can put a check there and return
  NaN if the user is not included in the model. -Xiangrui
 
  On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   Hi,
  
   I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
   but
   the code fails on userFeatures.lookup(user).head
  
   In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
   been
   called and in all the test-cases that API has been used...
  
   I can perhaps refactor my code to do the same but I was wondering
   whether
   people test the lookup(user) version of the code..
  
   Do I need to cache the model to make it work ? I think right now
 default
   is
   STORAGE_AND_DISK...
  
   Thanks.
   Deb
 
 



Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi,

I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...

I am getting a weird error which I have not seen before:

MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
hdfs://localhost:8020/sandbox/movielens/

2014-11-04 11:01:17.350 java[56469:1903] Unable to load realm mapping info
from SCDynamicStore

14/11/04 11:01:17 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/11/04 11:01:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
192.168.107.125): java.io.InvalidClassException:
org.apache.spark.examples.mllib.MovieLensALS$Params; no valid constructor

For Params there is no valid constructorIs this due to the
AbstractParam change ?

Thanks.

Deb


Re: Spark LIBLINEAR

2014-10-27 Thread Debasish Das
Hi Professor Lin,

It will be great if you could please review the TRON code in breeze and
whether it is similar to the original TRON implementation...Breeze is
already integrated in mllib (we are using BFGS and OWLQN is under works for
mllib LogisticRegression) and comparing with TRON should be quick...

The code is over here:

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/TruncatedNewtonMinimizer.scala

If there are bugs, fixing them in breeze will be very helpful for people
who are building upon it (like me :-)

Thanks.
Deb

On Sun, Oct 26, 2014 at 6:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote:

 Debasish Das writes:
   If the SVM is not already migrated to BFGS, that's the first thing you
 should
   try...Basically following LBFGS Logistic Regression come up with LBFGS
 based
   linear SVM...
  
   About integrating TRON in mllib, David already has a version of TRON in
 breeze
   but someone needs to validate it for linear SVM and do experiment to
 see if it
   can improve upon LBFGS based linear SVM...Based on lib-linear papers,
 it should
   but I don't expect substantial difference...
  
   I am validating TRON for use-cases related to this PR (but I need more
 features
   on top of TRON):
  
   https://github.com/apache/spark/pull/2705
  

 We are also working on integrating TRON to MLlib, though we haven't cheked
 the above work.

 We are also doing some serious comparison between quasi Newton and Newton,
 though this will take some time

 Chih-Jen
   On Fri, Oct 24, 2014 at 2:09 PM, k.tham kevins...@gmail.com wrote:
  
   Just wondering, any update on this? Is there a plan to integrate
 CJ's work
   with mllib? I'm asking since  SVM impl in MLLib did not give us good
   results
   and we have to resort to training our svm classifier in a serial
 manner on
   the driver node with liblinear.
  
   Also, it looks like CJ Lin is coming to the bay area in the coming
 weeks
   (http://www.meetup.com/sfmachinelearning/events/208078582/) might
 be a good
   time to connect with him.
  
   --
   View this message in context: http://
   apache-spark-user-list.1001560.n3.nabble.com/
   Spark-LIBLINEAR-tp5546p17236.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
  -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  



Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...

About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM and do experiment to
see if it can improve upon LBFGS based linear SVM...Based on lib-linear
papers, it should but I don't expect substantial difference...

I am validating TRON for use-cases related to this PR (but I need more
features on top of TRON):

https://github.com/apache/spark/pull/2705


On Fri, Oct 24, 2014 at 2:09 PM, k.tham kevins...@gmail.com wrote:

 Just wondering, any update on this? Is there a plan to integrate CJ's work
 with mllib? I'm asking since  SVM impl in MLLib did not give us good
 results
 and we have to resort to training our svm classifier in a serial manner on
 the driver node with liblinear.

 Also, it looks like CJ Lin is coming to the bay area in the coming weeks
 (http://www.meetup.com/sfmachinelearning/events/208078582/) might be a
 good
 time to connect with him.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17236.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?

Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as well...

On Fri, Oct 24, 2014 at 3:02 PM, DB Tsai dbt...@dbtsai.com wrote:

 We don't have SVMWithLBFGS, but you can check out how we implement
 LogisticRegressionWithLBFGS, and we also deal with some condition
 number improving stuff in LogisticRegressionWithLBFGS which improves
 the performance dramatically.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Oct 24, 2014 at 2:39 PM, k.tham kevins...@gmail.com wrote:
  Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented.
 I'll
  try it out when I have time. Thanks!
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin,

This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;]
and x is what you want to find..B is 0 in this case...For mllib normally
this is labelbasically create a labeledPoint where label is 0 always...

Use mllib's linear regression and solve the following problem:

min ||Ax - B||_{2}^{2} + lambda||x||_{2}^{2}

Put a small regularization to condition the problem (~1e-4)...and play with
some options for learning rate in linear regression...

The parameter vector that you get out of mllib linear regression is the
answer to your linear equation solver...

Thanks.
Deb



On Wed, Oct 22, 2014 at 4:15 PM, Martin Enzinger martin.enzin...@gmail.com
wrote:

 Hi,

 I'm wondering how to use Mllib for solving equation systems following this
 pattern

 2*x1 + x2 + 3*x3 +  + xn = 0
 x1 + 0*x2 + 3*x3 +  + xn = 0
 ..
 ..
 0*x1 + x2 + 0*x3 +  + xn = 0

 I definitely still have some reading to do to really understand the direct
 solving techniques, but at the current state of knowledge SVD could help
 me with this right?

 Can you point me to an example or a tutorial?

 best regards



Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
Thanks for the pointers

I will look into oryx2 design and see whether we need a spary/akka http
based backend...I feel we will specially when we have a model database for
a number of scenarios (say 100 scenarios build a different ALS model)

I am not sure if we really need a full blown database and so the first
version will simply be parquet files...

On Sun, Oct 19, 2014 at 1:14 PM, Jayant Shekhar jay...@cloudera.com wrote:

 Hi Deb,

 Do check out https://github.com/OryxProject/oryx.

 It does integrate with Spark. Sean has put in quite a bit of neat details
 on the page about the architecture. It has all the things you are thinking
 about:)

 Thanks,
 Jayant


 On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 Is someone working on a project on integrating Oryx model serving layer
 with Spark ? Models will be built using either Streaming data / Batch data
 in HDFS and cross validated with mllib APIs but the model serving layer
 will give API endpoints like Oryx
 and read the models may be from hdfs/impala/SparkSQL

 One of the requirement is that the API layer should be scalable and
 elastic...as requests grow we should be able to add more nodes...using play
 and akka clustering module...

 If there is a ongoing project on github please point to it...

 Is there a plan of adding model serving and experimentation layer to
 mllib ?

 Thanks.
 Deb






Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi,

Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming data / Batch data
in HDFS and cross validated with mllib APIs but the model serving layer
will give API endpoints like Oryx
and read the models may be from hdfs/impala/SparkSQL

One of the requirement is that the API layer should be scalable and
elastic...as requests grow we should be able to add more nodes...using play
and akka clustering module...

If there is a ongoing project on github please point to it...

Is there a plan of adding model serving and experimentation layer to mllib ?

Thanks.
Deb


Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei !

Congratulations to the databricks team and all the community members...

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: protobuf error running spark on hadoop 2.4

2014-10-08 Thread Debasish Das
I have faced this in the past and I have to put a profile -Phadoop2.3...

mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install

On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote:

 Hi:

 I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
 wordcount example using spark_shell on mesos. When I ran my application, I
 got following error that looks related to the mismatch of protobuf versions
 between hadoop cluster (protobuf 2.5) and spark (protobuf 4.1). I ran mvn
 dependency:tree -Dincludes=*protobuf*, and found that zkka pulled in this
 protobuf 4.1 Have anyone seen this problem before ? Thanks.

 Error when running spark on hadoop 2.4.0
 *java.lang.VerifyError: class
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$AppendRequestProto
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;*

  mvn dependency:tree -Dincludes=*protobuf*
 ...




 *[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
 spark-core_2.10 ---[INFO] org.apache.spark:spark-core_2.10:jar:1.1.0[INFO]
 \-
 org.spark-project.akka:akka-remote_2.10:jar:2.2.3-shaded-protobuf:compile[INFO]
 \- org.spark-project.protobuf:protobuf-java:jar:2.4.1-shaded:compile*



Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...

For rest of them only cache if you have lot of free memory !

On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote:

 I think you mean that data2 is a function of data1 in the first
 example. I imagine that the second version is a little bit more
 efficient.

 But it is nothing to do with memory or caching. You don't have to
 cache anything here if you don't want to. You can cache what you like.
 Once memory for the cache fills up, some partitions will be dropped
 from the cache. Obviously, if your cache is full of RDD partitions
 that you don't need, that's wasting space that could be used for
 caching data you need. It's a good idea to unpersist RDDs than no
 longer need to be cached, of course.

 If you don't need intermediate RDD data1, then certainly don't cache
 it, but its existence doesn't do much.

 On Mon, Oct 6, 2014 at 9:56 PM, anny9699 anny9...@gmail.com wrote:
  Hi,
 
  I see that this type of question has been asked before, however still a
  little confused about it in practice. Such as there are two ways I could
  deal with a series of RDD transformation before I do a RDD action, which
 way
  is faster:
 
  Way 1:
  val data = sc.textFile()
  val data1 = data.map(x = f1(x))
  val data2 = data.map(x1 = f2(x1))
  println(data2.count())
 
  Way2:
  val data = sc.textFile(0
  val data2 = data.map(x = f2(f1(x)))
  println(data2.count())
 
  Since Spark doesn't materialize RDD transformations, so I assume the two
  ways are equal?
 
  I asked this because the memory of my cluster is very limited and I don't
  want to cache a RDD at the very early stage. Is it true that if I cache a
  RDD early and take the space, then I need to unpersist it before I cache
  another in order to save the memory?
 
  Thanks a lot!
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/lazy-evaluation-of-RDD-transformation-tp15811.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Impala comparisons

2014-10-04 Thread Debasish Das
Hi,

We write the output of models and other information as parquet files and
later we let data APIs run SQL queries on the columnar data...

SparkSQL is used to dump the data in parquet format and now we are
considering whether using SparkSQL or Impala to read it back...

I came across this benchmark and I was not sure if this has been validated
by Spark community / databricks...

http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-performance-gap/

Any inputs will be helpful...

Thanks.
Deb


Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ?

class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] {
}


On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com
wrote:

 Just realized that, of course, objects can't be generic, but how do I
 create a generic AccumulatorParam?

 2014-10-01 12:33 GMT+02:00 Johan Stenberg johanstenber...@gmail.com:

 Hi,

 I have a problem with using accumulators in Spark. As seen on the Spark
 website, if you want custom accumulators you can simply extend (with an
 object) the AccumulatorParam trait. The problem is that I need to make that
 object generic, such as this:

 object SeqAccumulatorParam[B] extends AccumulatorParam[Seq[B]] {

 override def zero(initialValue: Seq[B]): Seq[B] = Seq[B]()

 override def addInPlace(s1: Seq[B], s2: Seq[B]): Seq[B] = s1 ++ s2

 }

 But this gives me a compile error because objects can't use generic
 parameters. My situation doesn't really allow me to define a
 SeqAccumulatorParam for each given type since that would lead to a lot of
 ugly code duplication.

 I have an alternative method, just placing all of the results in an RDD
 and then later iterating over them with an accumulator, defined for that
 single type, but this would be much nicer.

 My question is: is there any other way to create accumulators or some
 magic for making generics and objects work?

 Cheers,

 Johan





Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit
formulation...

On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote:

 We don't handle missing value imputation in the current version of
 MLlib. In future releases, we can store feature information in the
 dataset metadata, which may store the default value to replace missing
 values. But no one is committed to work on this feature. For now, you
 can filter out examples containing missing values and use the rest for
 training. -Xiangrui

 On Tue, Sep 30, 2014 at 11:26 AM, Sameer Tilak ssti...@live.com wrote:
  Hi All,
  Can someone please me to the documentation that describes how missing
 value
  imputation is done in MLLib. Also, any information of how this fits in
 the
  overall roadmap will be great.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...

On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote:

 Hi Boromir,

 Assuming the tree fits in memory, and what you want to do is parallelize
 the computation, the 'obvious' way is the following:

 * broadcast the tree T to each worker (ok since it fits in memory)
 * construct an RDD for the deepest level - each element in the RDD is
 (parent,data_at_node)
 * aggregate this by key (=parent) - RDD[parent,data]
 * map each element (p, data) - (parent(p), data) using T
 * repeat until you have an RDD of size = 1 (assuming T is connected)

 If T cannot fit in memory, or is very deep, then there are more exotic
 techniques, but hopefully this suffices.

 Andy


 --
 http://www.cs.ox.ac.uk/people/andy.twigg/

 On 30 September 2014 14:12, Boromir Widas vcsub...@gmail.com wrote:

 Hello Folks,

 I have been trying to implement a tree reduction algorithm recently in
 spark but could not find suitable parallel operations. Assuming I have a
 general tree like the following -



 I have to do the following -
 1) Do some computation at each leaf node to get an array of doubles.(This
 can be pre computed)
 2) For each non leaf node, starting with the root node compute the sum of
 these arrays for all child nodes. So to get the array for node B, I need to
 get the array for E, which is the sum of G + H.

 // Start Snippet
 case class Node(name: String, children: Array[Node], values:
 Array[Double])

 // read in the tree here

 def getSumOfChildren(node: Node) : Array[Double] = {
 if(node.isLeafNode) {
   return node.values
}
 foreach(child in node.children) {
// can use an accumulator here
node.values = (node.values,
 getSumOfChildren(child)).zipped.map(_+_)
}
node.values
 }
 // End Snippet

 Any pointers to how this can be done in parallel to use all cores will be
 greatly appreciated.

 Thanks,
 Boromir.





Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative
algorithm

For map-reduce operations, it's better not to cache if you have a memory
crunch...

Also schedule the persist and unpersist such that you utilize the RAM
well...

On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi,

 By default, 60% of JVM memory is reserved for RDD caching, so in your
 case, 72GB memory is available for RDDs which means that your total data
 may fit in memory. You can check the RDD memory statistics via the storage
 tab in web ui.

 Hope this helps!
 Liquan



 On Tue, Sep 30, 2014 at 4:11 PM, anny9699 anny9...@gmail.com wrote:

 Hi,

 Is there a guidance about for a data of certain data size, how much total
 memory should be needed to achieve a relatively good speed?

 I have a data of around 200 GB and the current total memory for my 8
 machines are around 120 GB. Is that too small to run the data of this big?
 Even the read in and simple initial processing seems to last forever.

 Thanks a lot!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/memory-vs-data-size-tp15443.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst



Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt

On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi Ted,

 It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase
 (in batch).

 BTW, I found batched Put actually faster than generating HFiles...


 Jianshi

 On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$
 anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179)

 Can you reveal what HbaseRDDBatch.scala does ?

 Cheers

 On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 One of my big spark program always get stuck at 99% where a few tasks
 never finishes.

 I debugged it by printing out thread stacktraces, and found there're
 workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup.

 Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1. The
 parquet files are generated by pig using latest parquet-pig-bundle
 v1.6.0rc1.

 From Spark 1.1.0's pom.xml, Spark is using parquet v1.4.3, will this be
 problematic?

 One of the weird behavior is that another program read and sort data
 read from the same parquet files and it works fine. The only difference
 seems the buggy program uses foreachPartition and the working program uses
 map.

 Here's the full stacktrace:

 Executor task launch worker-3
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:257)
 at
 sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
 at
 org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
 at
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
 at java.io.DataInputStream.readFully(DataInputStream.java:195)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at
 parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
 at
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
 at
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 at
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 at
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913)
 at
 scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
 at
 scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
 at
 scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179)
 at
 

Re: task getting stuck

2014-09-24 Thread Debasish Das
spark SQL reads parquet file fine...did you follow one of these to
read/write parquet from spark ?

http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu yuzhih...@gmail.com wrote:

 Adding a subject.

 bq.   at parquet.hadoop.ParquetFileReader$
 ConsecutiveChunkList.readAll(ParquetFileReader.java:599)

 Looks like there might be some issue reading the Parquet file.

 Cheers

 On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi Ted,

 See my previous reply to Debasish, all region servers are idle. I don't
 think it's caused by hotspotting.

 Besides, only 6 out of 3000 tasks were stuck, and their inputs are about
 only 80MB each.

 Jianshi

 On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was thinking along the same line.

 Jianshi:
 See
 http://hbase.apache.org/book.html#d0e6369

 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 HBase regionserver needs to be balancedyou might have some skewness
 in row keys and one regionserver is under pressuretry finding that key
 and replicate it using random salt

 On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Hi Ted,

 It converts RDD[Edge] to HBase rowkey and columns and insert them to
 HBase (in batch).

 BTW, I found batched Put actually faster than generating HFiles...


 Jianshi

 On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. at com.paypal.risk.rds.dragon.
 storage.hbase.HbaseRDDBatch$$anonfun$batchInsertEdges$3.
 apply(HbaseRDDBatch.scala:179)

 Can you reveal what HbaseRDDBatch.scala does ?

 Cheers

 On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 One of my big spark program always get stuck at 99% where a few
 tasks never finishes.

 I debugged it by printing out thread stacktraces, and found there're
 workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup.

 Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1.
 The parquet files are generated by pig using latest parquet-pig-bundle
 v1.6.0rc1.

 From Spark 1.1.0's pom.xml, Spark is using parquet v1.4.3, will this
 be problematic?

 One of the weird behavior is that another program read and sort data
 read from the same parquet files and it works fine. The only difference
 seems the buggy program uses foreachPartition and the working program 
 uses
 map.

 Here's the full stacktrace:

 Executor task launch worker-3
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at
 sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:257)
 at
 sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
 at
 sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
 at
 org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
 at
 org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
 at
 org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
 at
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
 at
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
 at
 java.io.DataInputStream.readFully(DataInputStream.java:195)
 at
 java.io.DataInputStream.readFully(DataInputStream.java:169)
 at
 parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
 at
 parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
 at
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
 at
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
 at
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
 at
 org.apache.spark.InterruptibleIterator.hasNext

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
zipWithUniqueId is also affected...

I had to persist the dictionaries to make use of the indices lower down in
the flow...

On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen so...@cloudera.com wrote:

 Reference - https://issues.apache.org/jira/browse/SPARK-3098
 I imagine zipWithUniqueID is also affected, but may not happen to have
 exhibited in your test.

 On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Some more debug revealed that as Sean said I have to keep the
 dictionaries
  persisted till I am done with the RDD manipulation.
 
  Thanks Sean for the pointer...would it be possible to point me to the
 JIRA
  as well ?
 
  Are there plans to make it more transparent for the users ?
 
  Is it possible for the DAG to speculate such things...similar to branch
  prediction ideas from comp arch...
 
 
 
  On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das debasish.da...@gmail.com
  wrote:
 
  I changed zipWithIndex to zipWithUniqueId and that seems to be
 working...
 
  What's the difference between zipWithIndex vs zipWithUniqueId ?
 
  For zipWithIndex we don't need to run the count to compute the offset
  which is needed for zipWithUniqueId and so zipWithIndex is efficient ?
 It's
  not very clear from docs...
 
 
  On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com
 
  wrote:
 
  I did not persist / cache it as I assumed zipWithIndex will preserve
  order...
 
  There is also zipWithUniqueId...I am trying that...If that also shows
 the
  same issue, we should make it clear in the docs...
 
  On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com wrote:
 
  From offline question - zipWithIndex is being used to assign IDs.
 From a
  recent JIRA discussion I understand this is not deterministic within a
  partition so the index can be different when the RDD is reevaluated.
 If you
  need it fixed, persist the zipped RDD on disk or in memory.
 
  On Sep 20, 2014 8:10 PM, Debasish Das debasish.da...@gmail.com
  wrote:
 
  Hi,
 
  I am building a dictionary of RDD[(String, Long)] and after the
  dictionary is built and cached, I find key almonds at value 5187
 using:
 
  rdd.filter{case(product, index) = product == almonds}.collect
 
  Output:
 
  Debug product almonds index 5187
 
  Now I take the same dictionary and write it out as:
 
  dictionary.map{case(product, index) = product + , + index}
  .saveAsTextFile(outputPath)
 
  Inside the map I also print what's the product at index 5187 and I
 get
  a different product:
 
  Debug Index 5187 userOrProduct cardigans
 
  Is this an expected behavior from map ?
 
  By the way almonds and apparel-cardigans are just one off in the
  index...
 
  I am using spark-1.1 but it's a snapshot..
 
  Thanks.
  Deb
 
 
 
 
 



Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi,

I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key almonds at value 5187 using:

rdd.filter{case(product, index) = product == almonds}.collect

Output:

Debug product almonds index 5187
Now I take the same dictionary and write it out as:

dictionary.map{case(product, index) = product + , + index}
.saveAsTextFile(outputPath)

Inside the map I also print what's the product at index 5187 and I get a
different product:

Debug Index 5187 userOrProduct cardigans

Is this an expected behavior from map ?

By the way almonds and apparel-cardigans are just one off in the
index...

I am using spark-1.1 but it's a snapshot..

Thanks.
Deb


Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I did not persist / cache it as I assumed zipWithIndex will preserve
order...

There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear in the docs...

On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com wrote:

 From offline question - zipWithIndex is being used to assign IDs. From a
 recent JIRA discussion I understand this is not deterministic within a
 partition so the index can be different when the RDD is reevaluated. If you
 need it fixed, persist the zipped RDD on disk or in memory.
 On Sep 20, 2014 8:10 PM, Debasish Das debasish.da...@gmail.com wrote:

 Hi,

 I am building a dictionary of RDD[(String, Long)] and after the
 dictionary is built and cached, I find key almonds at value 5187 using:

 rdd.filter{case(product, index) = product == almonds}.collect

 Output:

 Debug product almonds index 5187
 Now I take the same dictionary and write it out as:

 dictionary.map{case(product, index) = product + , + index}
 .saveAsTextFile(outputPath)

 Inside the map I also print what's the product at index 5187 and I get a
 different product:

 Debug Index 5187 userOrProduct cardigans

 Is this an expected behavior from map ?

 By the way almonds and apparel-cardigans are just one off in the
 index...

 I am using spark-1.1 but it's a snapshot..

 Thanks.
 Deb





Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I changed zipWithIndex to zipWithUniqueId and that seems to be working...

What's the difference between zipWithIndex vs zipWithUniqueId ?

For zipWithIndex we don't need to run the count to compute the offset which
is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not
very clear from docs...


On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com
wrote:

 I did not persist / cache it as I assumed zipWithIndex will preserve
 order...

 There is also zipWithUniqueId...I am trying that...If that also shows the
 same issue, we should make it clear in the docs...

 On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com wrote:

 From offline question - zipWithIndex is being used to assign IDs. From a
 recent JIRA discussion I understand this is not deterministic within a
 partition so the index can be different when the RDD is reevaluated. If you
 need it fixed, persist the zipped RDD on disk or in memory.
 On Sep 20, 2014 8:10 PM, Debasish Das debasish.da...@gmail.com wrote:

 Hi,

 I am building a dictionary of RDD[(String, Long)] and after the
 dictionary is built and cached, I find key almonds at value 5187 using:

 rdd.filter{case(product, index) = product == almonds}.collect

 Output:

 Debug product almonds index 5187
 Now I take the same dictionary and write it out as:

 dictionary.map{case(product, index) = product + , + index}
 .saveAsTextFile(outputPath)

 Inside the map I also print what's the product at index 5187 and I get a
 different product:

 Debug Index 5187 userOrProduct cardigans

 Is this an expected behavior from map ?

 By the way almonds and apparel-cardigans are just one off in the
 index...

 I am using spark-1.1 but it's a snapshot..

 Thanks.
 Deb






Re: MLLib: LIBSVM issue

2014-09-18 Thread Debasish Das
We dump fairly big libsvm to compare against liblinear/libsvm...the
following code dumps out libsvm format from SparseVector...

def toLibSvm(features: SparseVector): String = {

val indices = features.indices.map(_ + 1)

val values = features.values

indices.zip(values).mkString( ).replace(',', ':').replace((, 
).replace(), )

  }



On Wed, Sep 17, 2014 at 9:11 PM, Burak Yavuz bya...@stanford.edu wrote:

 Hi,

 The spacing between the inputs should be a single space, not a tab. I feel
 like your inputs have tabs between them instead of a single space.
 Therefore the parser
 cannot parse the input.

 Best,
 Burak

 - Original Message -
 From: Sameer Tilak ssti...@live.com
 To: user@spark.apache.org
 Sent: Wednesday, September 17, 2014 7:25:10 PM
 Subject: MLLib: LIBSVM issue

 Hi All,We have a fairly large amount of sparse data. I was following the
 following instructions in the manual:
 Sparse dataIt is very common in practice to have sparse training data.
 MLlib supports reading training examples stored in LIBSVM format, which is
 the default format used by LIBSVM and LIBLINEAR. It is a text format in
 which each line represents a labeled sparse feature vector using the
 following format:label index1:value1 index2:value2 ...
 import org.apache.spark.mllib.regression.LabeledPointimport
 org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.rdd.RDD
 val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
 data/mllib/sample_libsvm_data.txt)
 I believe that I have formatted my data as per the required Libsvm format.
 Here is a snippet of that:
 1122:11693:11771:11974:12334:1
 2378:12562:1 1118:11389:11413:1
 1454:11780:12562:15051:15417:1
 5548:15798:15862:1 0150:1214:1
 468:11013:11078:11092:11117:1
 1489:11546:11630:11635:11827:1
 2024:12215:12478:12761:15985:1
 6115:16218:1 0251:15578:1
 However,When I use MLUtils.loadLibSVMFile(sc, path-to-data-file)I get
 the following error messages in mt spark-shell. Can someone please point me
 in right direction.
 java.lang.NumberFormatException: For input string: 150:1214:1
 468:11013:11078:11092:11117:1
 1489:11546:11630:11635:11827:1
 2024:12215:12478:12761:15985:1
 6115:16218:1 at
 sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
  at java.lang.Double.parseDouble(Double.java:540) at
 scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Huge matrix

2014-09-18 Thread Debasish Das
Yup that's what I did for now...

On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 I am not templating RowMatrix/CoordinateMatrix since that would be a big
 deviation from the PR. We can add jaccard and other similarity measures in
 later PRs.

 In the meantime, you can un-normalize the cosine similarities to get the
 dot product, and then compute the other similarity measures from the dot
 product.

 Best,
 Reza


 On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 In similarColumns, it seems with cosine similarity I also need other
 numbers such as intersection, jaccard and other measures...

 Right now I modified the code to generate jaccard but I had to run it
 twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
 modify RowMatrix and CoordinateMatrix to be templated on the value...

 Are you considering this in your design ?

 Thanks.
 Deb


 On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have
 the same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just
 use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the
 one available in LabeledPoint...should I add one ? I want to export the
 matrix out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com
 wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard
 is something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine similarity, 
 there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is generating
 the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, 
 vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the
 weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with 
 appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is
 will depend on your application and domain, it's worth trying if 
 the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of 
 similar points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am
 considering running a matrix factorization to reduce the dimension 
 to say
 ~60M x 50 and then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where
 rows are ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run
 fine...

 But for tall and wide, what do you suggest ? can dimsum handle
 it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh 
 r...@databricks.com wrote:

 You might want to wait until Wednesday

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?

For logistic you might want both positive and negative feature...so just
pass it through a filter on abs and then pick top(k)

On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak ssti...@live.com wrote:

 Hi All,

 I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB
 Libsvm file of sparse data) with 6700 features.

 val model = LinearRegressionWithSGD.train(examples, numIterations)

 At the end I get a model that

 model.weights.size
 res6: Int = 6699

 I am assuming each entry in the model is weight for the corresponding
 feature/index.  However,, if I want to get the top10 most important
 features or all features with weights higher than certain threshold, is
 that functionality available out-of-box? I can implement that on my own,
 but seems like a common feature that most of the people will need when they
 are working on high-dimensional dataset.






Re: Huge matrix

2014-09-18 Thread Debasish Das
Hi Reza,

Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?

This number does not look like a Monoid aggregation...iVal * jVal /
(math.min(sg, colMags(i)) * math.min(sg, colMags(j))

I am noticing some weird behavior as different runs are changing the
results...

Also can columnMagnitudes produce non-deterministic results ?

Thanks.

Deb

On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 I am not templating RowMatrix/CoordinateMatrix since that would be a big
 deviation from the PR. We can add jaccard and other similarity measures in
 later PRs.

 In the meantime, you can un-normalize the cosine similarities to get the
 dot product, and then compute the other similarity measures from the dot
 product.

 Best,
 Reza


 On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 In similarColumns, it seems with cosine similarity I also need other
 numbers such as intersection, jaccard and other measures...

 Right now I modified the code to generate jaccard but I had to run it
 twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
 modify RowMatrix and CoordinateMatrix to be templated on the value...

 Are you considering this in your design ?

 Thanks.
 Deb


 On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have
 the same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just
 use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the
 one available in LabeledPoint...should I add one ? I want to export the
 matrix out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com
 wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard
 is something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine similarity, 
 there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is generating
 the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, 
 vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the
 weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with 
 appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is
 will depend on your application and domain, it's worth trying if 
 the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of 
 similar points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am
 considering running a matrix factorization to reduce the dimension 
 to say
 ~60M x 50 and then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where
 rows are ~ 60M and columns are 10M say with billion data

Re: Huge matrix

2014-09-18 Thread Debasish Das
I am still a bit confused whether numbers like these can be aggregated as
double:

iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j))

It should be aggregated using something like List[iVal*jVal, colMags(i),
colMags(j)]

I am not sure Algebird can aggregate deterministically over Double...







On Thu, Sep 18, 2014 at 2:08 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,
 I am currently seeding the algorithm to be pseudo-random, this is an issue
 being addressed in the PR. If you pull the current version it will be
 deterministic, but not potentially not pseudo-random. The PR will updated
 today.
 Best,
 Reza

 On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 Have you tested if different runs of the algorithm produce different
 similarities (basically if the algorithm is deterministic) ?

 This number does not look like a Monoid aggregation...iVal * jVal /
 (math.min(sg, colMags(i)) * math.min(sg, colMags(j))

 I am noticing some weird behavior as different runs are changing the
 results...

 Also can columnMagnitudes produce non-deterministic results ?

 Thanks.

 Deb

 On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 I am not templating RowMatrix/CoordinateMatrix since that would be a big
 deviation from the PR. We can add jaccard and other similarity measures in
 later PRs.

 In the meantime, you can un-normalize the cosine similarities to get the
 dot product, and then compute the other similarity measures from the dot
 product.

 Best,
 Reza


 On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 In similarColumns, it seems with cosine similarity I also need other
 numbers such as intersection, jaccard and other measures...

 Right now I modified the code to generate jaccard but I had to run it
 twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
 modify RowMatrix and CoordinateMatrix to be templated on the value...

 Are you considering this in your design ?

 Thanks.
 Deb


 On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
  wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian
 have the same cost, so you can do either one. For dense matrices with 
 say,
 1m columns this won't be computationally feasible and you'll want to 
 start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for 
 similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or
 just use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the
 one available in LabeledPoint...should I add one ? I want to export the
 matrix out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com
 wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard
 is something that will be useful) ? I guess it makes sense to add 
 some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine 
 similarity, there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is
 generating the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce
 the exact same result as doing catesian products of RDD[(product, 
 vector)]
 and computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the
 weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
  wrote:

 For 60M x 10K brute force and dimsum thresholding should be
 fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with 
 appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is
 will depend on your application and domain, it's worth trying

Re: Huge matrix

2014-09-17 Thread Debasish Das
Hi Reza,

In similarColumns, it seems with cosine similarity I also need other
numbers such as intersection, jaccard and other measures...

Right now I modified the code to generate jaccard but I had to run it twice
due to the design of RowMatrix / CoordinateMatrix...I feel we should modify
RowMatrix and CoordinateMatrix to be templated on the value...

Are you considering this in your design ?

Thanks.
Deb


On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have
 the same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just
 use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the one
 available in LabeledPoint...should I add one ? I want to export the matrix
 out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard is
 something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine similarity, 
 there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is generating
 the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, vector)] 
 and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is will
 depend on your application and domain, it's worth trying if the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of similar 
 points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering
 running a matrix factorization to reduce the dimension to say ~60M x 
 50 and
 then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows
 are ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run
 fine...

 But for tall and wide, what do you suggest ? can dimsum handle
 it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com
 wrote:

 You might want to wait until Wednesday since the interface will
 be changing in that PR before Wednesday, probably over the 
 weekend, so that
 you don't have to redo your code. Your call if you need it before 
 a week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ohh coolall-pairs brute force is also part of this PR ?
 Let me pull it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh 
 r...@databricks.com wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum
 in this PR

  1   2   >