Re: Welcome two new Apache Spark committers

2023-08-06 Thread Debasish Das
Congratulations Peter and Xidou.

On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan  wrote:

> Hi all,
>
> The Spark PMC recently voted to add two new committers. Please join me in
> welcoming them to their new role!
>
> - Peter Toth (Spark SQL)
> - Xiduo You (Spark SQL)
>
> They consistently make contributions to the project and clearly showed
> their expertise. We are very excited to have them join as committers.
>


Re: Offline elastic index creation

2022-11-10 Thread Debasish Das
Hi Vibhor,

We worked on a project to create lucene indexes using spark but the project
has not been managed for some time now. If there is interest we can
resurrect it

https://github.com/vsumanth10/trapezium/blob/master/dal/src/test/scala/com/verizon/bda/trapezium/dal/lucene/LuceneIndexerSuite.scala
https://www.databricks.com/session/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building

After lucene indexes were created we uploaded it to solr for search ui. We
did not ingest it to elastisearch though.

Our scale was 100m+ rows and 100k+ columns, spark + lucene worked fine

Thank you.
Deb


On Wed, Nov 9, 2022, 10:13 AM Vibhor Gupta 
wrote:

> Hi Spark Community,
>
> Is there a way to create elastic indexes offline and then import them to
> an elastic cluster ?
> We are trying to load an elastic index with around 10B documents (~1.5 to
> 2 TB data) using spark daily.
>
> I know elastic provides a snapshot restore functionality through
> GCS/S3/Azure, but is there a way to generate this snapshot offline using
> spark ?
>
> Thanks,
> Vibhor Gupta
>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Debasish Das
Congratulations Xinrong !

On Tue, Aug 9, 2022, 10:00 PM Rui Wang  wrote:

> Congrats Xinrong!
>
>
> -Rui
>
> On Tue, Aug 9, 2022 at 8:57 PM Xingbo Jiang  wrote:
>
>> Congratulations!
>>
>> Yuanjian Li 于2022年8月9日 周二20:31写道:
>>
>>> Congratulations, Xinrong!
>>>
>>> XiDuo You 于2022年8月9日 周二19:18写道:
>>>
 Congratulations!

 Haejoon Lee  于2022年8月10日周三 09:30写道:
 >
 > Congrats, Xinrong!!
 >
 > On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon 
 wrote:
 >>
 >> Hi all,
 >>
 >> The Spark PMC recently added Xinrong Meng as a committer on the
 project. Xinrong is the major contributor of PySpark especially Pandas API
 on Spark. She has guided a lot of new contributors enthusiastically. Please
 join me in welcoming Xinrong!
 >>

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: SIGMOD System Award for Apache Spark

2022-05-15 Thread Debasish Das
Congratulations to the whole spark community ! It's a great achievement.

On Sat, May 14, 2022, 2:49 AM Yikun Jiang  wrote:

> Awesome! Congrats to the whole community!
>
> On Fri, May 13, 2022 at 3:44 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> We recently found out that Apache Spark received
>>  the SIGMOD System Award this
>> year, given by SIGMOD (the ACM’s data management research organization) to
>> impactful real-world and research systems. This puts Spark in good company
>> with some very impressive previous recipients
>> . This award is
>> really an achievement by the whole community, so I wanted to say congrats
>> to everyone who contributes to Spark, whether through code, issue reports,
>> docs, or other means.
>>
>> Matei
>>
>


[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-12-23 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728020#comment-16728020
 ] 

Debasish Das commented on SPARK-24374:
--

Hi [~mengxr] with barrier mode available is it not possible to use native TF 
parameter server in place of using MPI ? Although we are offloading compute 
from spark to tf workers/ps, still if there is an exception that comes out, 
tracking it with native TF API might be easier than MPI exception...great work 
by the way...I was looking for a cloud-ml alternative using spark over 
aws/azure/gcp and looks like barrier should help a lot although I am still not 
clear on the limitations of TensorflowOnSpark project from Yahoo 
[https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like 
syntax but not sure if few partitions fails on some tfrecord read / 
communication exceptions whether it can re-run full job or it will only re-run 
the failed partition...I guess the exception from few partitions can be thrown 
back to spark driver and driver can take the action for re-run..when multiple 
tf training jobs get scheduled on the same spark cluster I suspect TFoS might 
have issues as well... 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: dremel paper example schema

2018-10-29 Thread Debasish Das
Open source impl of dremel is parquet !

On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta 
wrote:

> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the example from dremel paper
>> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark
>> using
>> pyspark and I wonder if it is possible at all?
>>
>> Trying to follow the paper example as close as possible I created this
>> document type:
>>
>> from pyspark.sql.types import *
>>
>> links_type = StructType([
>> StructField("Backward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> StructField("Forward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> ])
>>
>> language_type = StructType([
>> StructField("Code", StringType(), nullable=False),
>> StructField("Country", StringType())
>> ])
>>
>> names_type = StructType([
>> StructField("Language", ArrayType(language_type, containsNull=False)),
>> StructField("Url", StringType()),
>> ])
>>
>> document_type = StructType([
>> StructField("DocId", LongType(), nullable=False),
>> StructField("Links", links_type, nullable=True),
>> StructField("Name", ArrayType(names_type, containsNull=False))
>> ])
>>
>> But when I store data in parquet using this type, the resulting parquet
>> schema is different from the described in the paper:
>>
>> message spark_schema {
>>   required int64 DocId;
>>   optional group Links {
>> required group Backward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>> required group Forward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>>   }
>>   optional group Name (LIST) {
>> repeated group list {
>>   required group element {
>> optional group Language (LIST) {
>>   repeated group list {
>> required group element {
>>   required binary Code (UTF8);
>>   optional binary Country (UTF8);
>> }
>>   }
>> }
>> optional binary Url (UTF8);
>>   }
>> }
>>   }
>> }
>>
>> Moreover, if I create a parquet file with schema described in the dremel
>> paper using Apache Parquet Java API and try to read it into Apache Spark,
>> I
>> get an exception:
>>
>> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
>> while reading parquet files. One possible cause: Parquet column cannot be
>> converted in the corresponding files
>>
>> Is it possible to create example schema described in the dremel paper
>> using
>> Apache Spark and what is the correct approach to build this example?
>>
>> Regards,
>> Lubomir Chorbadjiev
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


[jira] [Comment Edited] (BEAM-3737) Key-aware batching function

2018-06-09 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507150#comment-16507150
 ] 

Debasish Das edited comment on BEAM-3737 at 6/9/18 8:21 PM:


I saw this is being mentioned in TFMA 
[https://github.com/tensorflow/model-analysis/blob/master/tensorflow_model_analysis/api/impl/evaluate.py]:_AggregateCombineFn...I
 am not clear why BatchElements() is neededgroupByKey takes combiner which 
should run on both map and reduce side...Am I missing something here ? Is it 
the case that beam Combiner does not run on map side ? [~robertwb] is that why 
you mentioned that we should run the combiner upfront in ParDo and then run 
groupByKey to achieve map and reduce side combine ?


was (Author: debasish83):
I saw this is being mentioned in TFMA...I am also not clear why BatchElements() 
is neededgroupByKey takes combiner which should run on both map and reduce 
side...Am I missing something here ? Is it the case that beam Combiner does not 
run on map side ? [~robertwb] is that why you mentioned that we should run the 
combiner upfront in ParDo and then run groupByKey to achieve map and reduce 
side combine ?

> Key-aware batching function
> ---
>
> Key: BEAM-3737
> URL: https://issues.apache.org/jira/browse/BEAM-3737
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chuan Yu Foo
>Priority: Major
>
> I have a CombineFn for which add_input has very large overhead. I would like 
> to batch the incoming elements into a large batch before each call to 
> add_input to reduce this overhead. In other words, I would like to do 
> something like: 
> {{elements | GroupByKey() | BatchElements() | CombineValues(MyCombineFn())}}
> Unfortunately, BatchElements is not key-aware, and can't be used after a 
> GroupByKey to batch elements per key. I'm working around this by doing the 
> batching within CombineValues, which makes the CombineFn rather messy. It 
> would be nice if there were a key-aware BatchElements transform which could 
> be used in this context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3737) Key-aware batching function

2018-06-09 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507150#comment-16507150
 ] 

Debasish Das commented on BEAM-3737:


I saw this is being mentioned in TFMA...I am also not clear why BatchElements() 
is neededgroupByKey takes combiner which should run on both map and reduce 
side...Am I missing something here ? Is it the case that beam Combiner does not 
run on map side ? [~robertwb] is that why you mentioned that we should run the 
combiner upfront in ParDo and then run groupByKey to achieve map and reduce 
side combine ?

> Key-aware batching function
> ---
>
> Key: BEAM-3737
> URL: https://issues.apache.org/jira/browse/BEAM-3737
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chuan Yu Foo
>Priority: Major
>
> I have a CombineFn for which add_input has very large overhead. I would like 
> to batch the incoming elements into a large batch before each call to 
> add_input to reduce this overhead. In other words, I would like to do 
> something like: 
> {{elements | GroupByKey() | BatchElements() | CombineValues(MyCombineFn())}}
> Unfortunately, BatchElements is not key-aware, and can't be used after a 
> GroupByKey to batch elements per key. I'm working around this by doing the 
> batching within CombineValues, which makes the CombineFn rather messy. It 
> would be nice if there were a key-aware BatchElements transform which could 
> be used in this context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2810) Consider a faster Avro library in Python

2018-03-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407543#comment-16407543
 ] 

Debasish Das commented on BEAM-2810:


[~chamikara] did you try fastavro and pyavroc as well ? both of them did not 
satisfy the byte position...I should be able to add the byte position code if 
you let me know which library is preferred for beam...pyavroc claims to be 4X 
faster than fastavro... 

> Consider a faster Avro library in Python
> 
>
> Key: BEAM-2810
> URL: https://issues.apache.org/jira/browse/BEAM-2810
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Priority: Major
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner

2018-03-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407540#comment-16407540
 ] 

Debasish Das commented on BEAM-1442:


Thanks [~robertwb]...I will look into BEAM-2810 if we can fix the avro 
perf...another quick question...do we have multithreaded local pipeline runner 
? somehow I missed it in pipeline configs...

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Charles Chen
>Priority: Major
>  Labels: gsoc2017, mentor, python
> Fix For: 2.4.0
>
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2810) Consider a faster Avro library in Python

2018-03-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407538#comment-16407538
 ] 

Debasish Das commented on BEAM-2810:


I will try reading bq from beam directly but during iterative processing, an 
intermediate format like avro can help...even better if parquet is supported 
but I did not see much support for parquet in GCP ecosystem...

> Consider a faster Avro library in Python
> 
>
> Key: BEAM-2810
> URL: https://issues.apache.org/jira/browse/BEAM-2810
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Priority: Major
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2810) Consider a faster Avro library in Python

2018-03-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407532#comment-16407532
 ] 

Debasish Das commented on BEAM-2810:


our flow starts from bq-export/gcs avro files and 1 node sizable dataset (say 
500 MB - 1 GB) to run multi-core (4/8) local pipelines will be needed for our 
use-case...Is it possible to expose java avro through py4j or some trick like 
that ? I should be able to take up the task...

> Consider a faster Avro library in Python
> 
>
> Key: BEAM-2810
> URL: https://issues.apache.org/jira/browse/BEAM-2810
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Priority: Major
>
> https://stackoverflow.com/questions/45870789/bottleneck-on-data-source
> Seems like this job is reading Avro files (exported by BigQuery) at about 2 
> MB/s.
> We use the standard Python "avro" library which is apparently known to be 
> very slow (10x+ slower than Java) 
> http://apache-avro.679487.n3.nabble.com/Avro-decode-very-slow-in-Python-td4034422.html,
>  and there are alternatives e.g. https://pypi.python.org/pypi/fastavro/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner

2018-03-20 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407241#comment-16407241
 ] 

Debasish Das commented on BEAM-1442:


Hi...I am pushing 10MB avro files on local and idea is to push a sizable amount 
of data in local mode for pipeline validation...Can I get this fix from pip to 
test it out on local files ?

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Charles Chen
>Priority: Major
>  Labels: gsoc2017, mentor, python
> Fix For: 2.4.0
>
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Lucene, Spark, HDFS question

2018-03-14 Thread Debasish Das
I have written spark lucene integration as part of Verizon trapezium/dal
project...you can extract the data stored in hdfs indices and feed it to
spark...

https://github.com/Verizon/trapezium/tree/master/dal/src/test/scala/com/verizon/bda/trapezium/dal

I intend to publish it as spark package as soon as I get time.

You can use spark-solr or spark-elastic but I did not want to bring solr
elastic dependency to be performant...

Thanks.
Deb

On Mar 13, 2018 4:31 PM, "Tom Hirschfeld"  wrote:

Hello!


*Background*: My team is running a machine learning pipeline, and part of
the pipeline is an http scrape of a web based Lucene application via
http calls. The scrape outputs a CSV file that we then upload to HDFS and
use it as input to run a spark ML job.

*Question: *Is there a way for our spark application to read from a lucene
index stored in HDFS?  Specifically, I see here

that
solr-core has an hdfs directory type that seems to be compatible with our
lucene indexreader. Is this compatible? Are we able to store our index in
HDFS and read from a spark job?


Best,
Tom Hirschfeld


ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi,

ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve each alternating steps with ECOS:

https://github.com/embotech/ecos-java-scala

For distributed optimization, I expect it will be useful where for each
primary row key (sensor, car, robot :-) we are fitting a constrained
quadratic / cone program. Please try it out and let me know the feedbacks.

Thanks.
Deb


ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi,

ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve each alternating steps with ECOS:

https://github.com/embotech/ecos-java-scala

For distributed optimization, I expect it will be useful where for each
primary row key (sensor, car, robot :-) we are fitting a constrained
quadratic / cone program. Please try it out and let me know the feedbacks.

Thanks.
Deb


Re: Hinge Gradient

2017-12-17 Thread Debasish Das
If you can point me to previous benchmarks that are done, I would like to
use smoothing and see if the LBFGS convergence improved while not impacting
linear svc loss.

Thanks.
Deb
On Dec 16, 2017 7:48 PM, "Debasish Das" <debasish.da...@gmail.com> wrote:

Hi Weichen,

Traditionally svm are solved using quadratic programming solvers and most
likely that's why this idea is not so popular but since in mllib we are
using smooth methods to optimize linear svm, the idea of smoothing svm loss
become relevant.

The paper also mentions kernel svm using the same idea. In place of full
kernel, we can use random kitchen sink.

http://research.cs.wisc.edu/dmi/svm/ssvm/

I will go through Yuhao's work as well.

Thanks.
Deb


On Dec 16, 2017 6:35 PM, "Weichen Xu" <weichen...@databricks.com> wrote:

Hi Deb,

Which library or paper do you find to use this loss function in SVM ?

But I prefer the implementation in LIBLINEAR which use coordinate descent
optimizer.

Thanks.

On Sun, Dec 17, 2017 at 6:52 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> Hello Deb,
>
> To optimize non-smooth function on LBFGS really should be considered
> carefully.
> Is there any literature that proves changing max to soft-max can behave
> well?
> I’m more than happy to see some benchmarks if you can have.
>
> + Yuhao, who did similar effort in this PR: https://github.com/apache/
> spark/pull/17862
>
> Regards
> Yanbo
>
> On Dec 13, 2017, at 12:20 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
> Hi,
>
> I looked into the LinearSVC flow and found the gradient for hinge as
> follows:
>
> Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
> Therefore the gradient is -(2y - 1)*x
>
> max is a non-smooth function.
>
> Did we try using ReLu/Softmax function and use that to smooth the hinge
> loss ?
>
> Loss function will change to SoftMax(0, 1 - (2y-1) (f_w(x)))
>
> Since this function is smooth, gradient will be well defined and
> LBFGS/OWLQN should behave well.
>
> Please let me know if this has been tried already. If not I can run some
> benchmarks.
>
> We have soft-max in multinomial regression and can be reused for LinearSVC
> flow.
>
> Thanks.
> Deb
>
>
>


Re: Hinge Gradient

2017-12-16 Thread Debasish Das
Hi Weichen,

Traditionally svm are solved using quadratic programming solvers and most
likely that's why this idea is not so popular but since in mllib we are
using smooth methods to optimize linear svm, the idea of smoothing svm loss
become relevant.

The paper also mentions kernel svm using the same idea. In place of full
kernel, we can use random kitchen sink.

http://research.cs.wisc.edu/dmi/svm/ssvm/

I will go through Yuhao's work as well.

Thanks.
Deb


On Dec 16, 2017 6:35 PM, "Weichen Xu" <weichen...@databricks.com> wrote:

Hi Deb,

Which library or paper do you find to use this loss function in SVM ?

But I prefer the implementation in LIBLINEAR which use coordinate descent
optimizer.

Thanks.

On Sun, Dec 17, 2017 at 6:52 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> Hello Deb,
>
> To optimize non-smooth function on LBFGS really should be considered
> carefully.
> Is there any literature that proves changing max to soft-max can behave
> well?
> I’m more than happy to see some benchmarks if you can have.
>
> + Yuhao, who did similar effort in this PR: https://github.com/apache/
> spark/pull/17862
>
> Regards
> Yanbo
>
> On Dec 13, 2017, at 12:20 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
> Hi,
>
> I looked into the LinearSVC flow and found the gradient for hinge as
> follows:
>
> Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
> Therefore the gradient is -(2y - 1)*x
>
> max is a non-smooth function.
>
> Did we try using ReLu/Softmax function and use that to smooth the hinge
> loss ?
>
> Loss function will change to SoftMax(0, 1 - (2y-1) (f_w(x)))
>
> Since this function is smooth, gradient will be well defined and
> LBFGS/OWLQN should behave well.
>
> Please let me know if this has been tried already. If not I can run some
> benchmarks.
>
> We have soft-max in multinomial regression and can be reused for LinearSVC
> flow.
>
> Thanks.
> Deb
>
>
>


Hinge Gradient

2017-12-13 Thread Debasish Das
Hi,

I looked into the LinearSVC flow and found the gradient for hinge as
follows:

Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
Therefore the gradient is -(2y - 1)*x

max is a non-smooth function.

Did we try using ReLu/Softmax function and use that to smooth the hinge
loss ?

Loss function will change to SoftMax(0, 1 - (2y-1) (f_w(x)))

Since this function is smooth, gradient will be well defined and
LBFGS/OWLQN should behave well.

Please let me know if this has been tried already. If not I can run some
benchmarks.

We have soft-max in multinomial regression and can be reused for LinearSVC
flow.

Thanks.
Deb


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Debasish Das
+1

Is there any design doc related to API/internal changes ? Will CP be the
default in structured streaming or it's a mode in conjunction with
exisiting behavior.

Thanks.
Deb

On Nov 1, 2017 8:37 AM, "Reynold Xin"  wrote:

Earlier I sent out a discussion thread for CP in Structured Streaming:

https://issues.apache.org/jira/browse/SPARK-20928

It is meant to be a very small, surgical change to Structured Streaming to
enable ultra-low latency. This is great timing because we are also
designing and implementing data source API v2. If designed properly, we can
have the same data source API working for both streaming and batch.


Following the SPIP process, I'm putting this SPIP up for a vote.

+1: Let's go ahead and design / implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.


Re: Restful API Spark Application

2017-05-16 Thread Debasish Das
You can run l
On May 15, 2017 3:29 PM, "Nipun Arora"  wrote:

> Thanks all for your response. I will have a look at them.
>
> Nipun
>
> On Sat, May 13, 2017 at 2:38 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> It's in scala but it should be portable in java
>> https://github.com/vgkowski/akka-spark-experiments
>>
>>
>> Le 12 mai 2017 10:54 PM, "Василец Дмитрий"  a
>> écrit :
>>
>> and livy https://hortonworks.com/blog/livy-a-rest-interface-for-
>> apache-spark/
>>
>> On Fri, May 12, 2017 at 10:51 PM, Sam Elamin 
>> wrote:
>> > Hi Nipun
>> >
>> > Have you checked out the job servwr
>> >
>> > https://github.com/spark-jobserver/spark-jobserver
>> >
>> > Regards
>> > Sam
>> > On Fri, 12 May 2017 at 21:00, Nipun Arora 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> We have written a java spark application (primarily uses spark sql). We
>> >> want to expand this to provide our application "as a service". For
>> this, we
>> >> are trying to write a REST API. While a simple REST API can be easily
>> made,
>> >> and I can get Spark to run through the launcher. I wonder, how the
>> spark
>> >> context can be used by service requests, to process data.
>> >>
>> >> Are there any simple JAVA examples to illustrate this use-case? I am
>> sure
>> >> people have faced this before.
>> >>
>> >>
>> >> Thanks
>> >> Nipun
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:

> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
> to find approximately nearest neighbors. Basically, We have dataset with
> about 7M rows. we want to use cosine distance to meassure the similarity
> between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-05 Thread Debasish Das
Hi Aseem,

Due to production deploy, we did not upgrade to 2.0 but that's critical
item on our list.

For exposing models out of PipelineModel, let me look into the ML
tasks...we should add it since dataframe should not be must for model
scoring...many times model are scored on api or streaming paths which don't
have micro batching involved...data directly lands from http or kafka/msg
queues...for such cases raw access to ML model is essential similar to
mllib model access...

Thanks.
Deb
On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbans...@gmail.com> wrote:

> @Debasish
>
> I see that the spark version being used in the project that you mentioned
> is 1.6.0. I would suggest that you take a look at some blogs related to
> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
> of latest Spark 2.1.0 release has no way to call predict on single vector.
> There is no API exposed. It is WIP but not yet released.
>
> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> If we expose an API to access the raw models out of PipelineModel can't
>> we call predict directly on it from an API ? Is there a task open to expose
>> the model out of PipelineModel so that predict can be called on itthere
>> is no dependency of spark context in ml model...
>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>
>>>
>>>- In Spark 2.0 there is a class called PipelineModel. I know that
>>>the title says pipeline but it is actually talking about PipelineModel
>>>trained via using a Pipeline.
>>>- Why PipelineModel instead of pipeline? Because usually there is a
>>>series of stuff that needs to be done when doing ML which warrants an
>>>ordered sequence of operations. Read the new spark ml docs or one of the
>>>databricks blogs related to spark pipelines. If you have used python's
>>>sklearn library the concept is inspired from there.
>>>- "once model is deserialized as ml model from the store of choice
>>>within ms" - The timing of loading the model was not what I was
>>>referring to when I was talking about timing.
>>>- "it can be used on incoming features to score through
>>>spark.ml.Model predict API". The predict API is in the old mllib package
>>>not the new ml package.
>>>- "why r we using dataframe and not the ML model directly from API"
>>>- Because as of now the new ml package does not have the direct API.
>>>
>>>
>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure why I will use pipeline to do scoring...idea is to build
>>>> a model, use model ser/deser feature to put it in the row or column store
>>>> of choice and provide a api access to the model...we support these
>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>> spark context in local or distributed mode...once model is deserialized as
>>>> ml model from the store of choice within ms, it can be used on incoming
>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>> 2200x speedup...why r we using dataframe and not the ML model directly from
>>>> API ?
>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>> What is your timezone in case someone wanted to talk?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>>> wrote:
>>>>>
>>>>>> Hey Aseem,
>>>>>>
>>>>>> We have built pipelines that execute several string indexers, one hot
>>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>>> Execution time for the linear regression was on the order of 11
>>>>>> microseconds, a bit longer for random forest. This can be further 
>>>>>> optimized
>>>>>> by using row-based transformations if your pipeline is simple to around 
>>>>>> 2-3
>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>>> the time all the processing was done, we had somewhere around 1000 
>>>>>> features
>>>>>> or so going into the linear regression after one hot encoding and
>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
Except of course lda als and neural net modelfor them the model need to
be either prescored and cached on a kv store or the matrices / graph should
be kept on kv store to access them using a REST API to serve the
output..for neural net its more fun since its a distributed or local  graph
over which tensorflow compute needs to run...

In trapezium we support writing these models to store like cassandra and
lucene for example and then provide config driven akka-http based API to
add the business logic to access these model from a store and expose the
model serving as REST endpoint

Matrix, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das" <debasish.da...@gmail.com> wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on itthere
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>
>>
>>- In Spark 2.0 there is a class called PipelineModel. I know that the
>>title says pipeline but it is actually talking about PipelineModel trained
>>via using a Pipeline.
>>- Why PipelineModel instead of pipeline? Because usually there is a
>>series of stuff that needs to be done when doing ML which warrants an
>>ordered sequence of operations. Read the new spark ml docs or one of the
>>databricks blogs related to spark pipelines. If you have used python's
>>sklearn library the concept is inspired from there.
>>- "once model is deserialized as ml model from the store of choice
>>within ms" - The timing of loading the model was not what I was
>>referring to when I was talking about timing.
>>- "it can be used on incoming features to score through
>>spark.ml.Model predict API". The predict API is in the old mllib package
>>not the new ml package.
>>- "why r we using dataframe and not the ML model directly from API" -
>>Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>
>>>> Does this support Java 7?
>>>> What is your timezone in case someone wanted to talk?
>>>>
>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey Aseem,
>>>>>
>>>>> We have built pipelines that execute several string indexers, one hot
>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>> Execution time for the linear regression was on the order of 11
>>>>> microseconds, a bit longer for random forest. This can be further 
>>>>> optimized
>>>>> by using row-based transformations if your pipeline is simple to around 
>>>>> 2-3
>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>> the time all the processing was done, we had somewhere around 1000 
>>>>> features
>>>>> or so going into the linear regression after one hot encoding and
>>>>> everything else.
>>>>>
>>>>> Hope this helps,
>>>>> Hollin
>>>>>
>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>>>> wrote:
>>>>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
If we expose an API to access the raw models out of PipelineModel can't we
call predict directly on it from an API ? Is there a task open to expose
the model out of PipelineModel so that predict can be called on itthere
is no dependency of spark context in ml model...
On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:

>
>- In Spark 2.0 there is a class called PipelineModel. I know that the
>title says pipeline but it is actually talking about PipelineModel trained
>via using a Pipeline.
>- Why PipelineModel instead of pipeline? Because usually there is a
>series of stuff that needs to be done when doing ML which warrants an
>ordered sequence of operations. Read the new spark ml docs or one of the
>databricks blogs related to spark pipelines. If you have used python's
>sklearn library the concept is inspired from there.
>- "once model is deserialized as ml model from the store of choice
>within ms" - The timing of loading the model was not what I was
>referring to when I was talking about timing.
>- "it can be used on incoming features to score through spark.ml.Model
>predict API". The predict API is in the old mllib package not the new ml
>package.
>- "why r we using dataframe and not the ML model directly from API" -
>Because as of now the new ml package does not have the direct API.
>
>
> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> I am not sure why I will use pipeline to do scoring...idea is to build a
>> model, use model ser/deser feature to put it in the row or column store of
>> choice and provide a api access to the model...we support these primitives
>> in github.com/Verizon/trapezium...the api has access to spark context in
>> local or distributed mode...once model is deserialized as ml model from the
>> store of choice within ms, it can be used on incoming features to score
>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>> r we using dataframe and not the ML model directly from API ?
>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>
>>> Does this support Java 7?
>>> What is your timezone in case someone wanted to talk?
>>>
>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>> wrote:
>>>
>>>> Hey Aseem,
>>>>
>>>> We have built pipelines that execute several string indexers, one hot
>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>> Execution time for the linear regression was on the order of 11
>>>> microseconds, a bit longer for random forest. This can be further optimized
>>>> by using row-based transformations if your pipeline is simple to around 2-3
>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>> the time all the processing was done, we had somewhere around 1000 features
>>>> or so going into the linear regression after one hot encoding and
>>>> everything else.
>>>>
>>>> Hope this helps,
>>>> Hollin
>>>>
>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>>> wrote:
>>>>
>>>>> Does this support Java 7?
>>>>>
>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is computational time for predictions on the order of few
>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>
>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>>
>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>>>> about MLeap and how you can use it to build production services from 
>>>>>>> your
>>>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that 
>>>>>>> allows
>>>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>>>> based on Protobuf 3 and JSON.
&g

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
local or distributed mode...once model is deserialized as ml model from the
store of choice within ms, it can be used on incoming features to score
through spark.ml.Model predict API...I am not clear on 2200x speedup...why
r we using dataframe and not the ML model directly from API ?
On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:

> Does this support Java 7?
> What is your timezone in case someone wanted to talk?
>
> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins  wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>>> wrote:
>>>
 Is computational time for predictions on the order of few milliseconds
 (< 10 ms) like the old mllib library?

 On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
 wrote:

> Hey everyone,
>
>
> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
> about MLeap and how you can use it to build production services from your
> Spark-trained ML pipelines. MLeap is an open-source technology that allows
> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
> Models to a scoring engine instantly. The MLeap execution engine has no
> dependencies on a Spark context and the serialization format is entirely
> based on Protobuf 3 and JSON.
>
>
> The recent 0.5.0 release provides serialization and inference support
> for close to 100% of Spark transformers (we don’t yet support ALS and 
> LDA).
>
>
> MLeap is open-source, take a look at our Github page:
>
> https://github.com/combust/mleap
>
>
> Or join the conversation on Gitter:
>
> https://gitter.im/combust/mleap
>
>
> We have a set of documentation to help get you started here:
>
> http://mleap-docs.combust.ml/
>
>
> We even have a set of demos, for training ML Pipelines and linear,
> logistic and random forest models:
>
> https://github.com/combust/mleap-demo
>
>
> Check out our latest MLeap-serving Docker image, which allows you to
> expose a REST interface to your Spark ML pipeline models:
>
> http://mleap-docs.combust.ml/mleap-serving/
>
>
> Several companies are using MLeap in production and even more are
> currently evaluating it. Take a look and tell us what you think! We hope 
> to
> talk with you soon and welcome feedback/suggestions!
>
>
> Sincerely,
>
> Hollin and Mikhail
>


>>>
>>
>


Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Debasish Das
You may want to pull up release/1.2 branch and 1.2.0 tag to build it
yourself incase the packages are not available.
On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim" 
wrote:

> Hi Ayan,
>
> Thanks a million.
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 15 January 2017 at 22:48, ayan guha  wrote:
>
>> archive.apache.org will always have all the releases:
>> http://archive.apache.org/dist/spark/
>>
>> @Spark users: it may be a good idea to have a "To download older
>> versions, click here" link to Spark Download page?
>>
>> On Mon, Jan 16, 2017 at 8:16 AM, Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> Hi,
>>>
>>> I am looking for Spark 1.2.0 version. I tried to download in the Spark
>>> website but it's no longer available.
>>>
>>> Any suggestion?
>>>
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> _
>>> *Md. Rezaul Karim*, BSc, MSc
>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> 
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809876#comment-15809876
 ] 

Debasish Das commented on SPARK-10078:
--

I looked into the code and I see we are replicating Breeze BFGS and OWLQN core 
logic in this PR:
https://github.com/yanboliang/spark-vlbfgs/tree/master/src/main/scala/org/apache/spark/ml/optim.

We can provide a DiffFunction interface that works on feature partition and add 
the VL-BFGS paper logic as a refactoring to current Breeze BFGS code...

Now DiffFunction can run with a DistributedVector or a Vector. What that helps 
with is that even with features < 100M, we can run multi-core VLBFGS with 
putting multiple partitions and a if-else switch is not necessary.

I can provide breeze interfaces based on your PR if you agree with the idea. 
BFGS and OWLQN are few variants but Breeze has several constraint solvers that 
use BFGS code...  

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793770#comment-15793770
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] [~dlwh] is it possible to implement VL-BFGS as part of breeze so that 
OWLQN, LBFGS, LBFGS-B and proximal.NonlinearMinimizer get benefited by it ? We 
can bring it the way we bring LBFGS/OWLQN right now...If it makes sense, I can 
look at the design doc and propose a breeze interface to abstract RDD details...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das edited comment on SPARK-10078 at 1/3/17 12:26 AM:
---

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...There are bounded BFGS that's in breeze...all of them will be 
benefited by this change. A solver can be used in other frameworks as well and 
may not be constrained to RDD if possible...


was (Author: debasish83):
Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das commented on SPARK-10078:
--

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das edited comment on SPARK-13857 at 12/26/16 5:57 AM:


item->item and user->user was done in an old PR I had...if there is interest I 
can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213


was (Author: debasish83):
item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das commented on SPARK-13857:
--

item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



SortedSetDocValue vs BinaryDocValues

2016-12-19 Thread Debasish Das
Hi,

I need to add col1:Array[String], col2:Array[Int] and col3:Array[Float] to
docvalue.

col1: Array[String] sparse dimension from OLAP world

col2: Array[Int] + Array[Float] represents a sparse vector for sparse
measure from OLAP world with dictionary encoding for col1 mapped to col2

I have few options to implement it:

1. Use SortedSetDocValuesField for each one of them with String, Int and
Float mapped to Byte

2. Generate byte array from Array[String], Array[Int] and Array[Float] and
save them as a byteBlob using BinaryDocValuesField

I know for sure that Array[Int] and Array[Float] will compress better if I
save them using specific encoding but I am confused whether to use 1 or 2
to implement the idea.

1 has a limitation on the number of bytes I can save and I am not sure if
pushing a Set to serialize to disk is a good idea (I am not sure yet if a
Set is being serialized to disk, most likely not).

I am open to coming up with specific encoding for Array data type where it
re-uses the current String, Int and Float encodings that we already have.

It will be great if experts can provide some pointers on using
SortedSetDocValues or serialize/deserialize using BinaryDocValuesField. The
idea of sparse dimension and measure comes from Oracle Essbase and I
believe we may bring in tensors as well in future.

Thanks.
Deb


[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581366#comment-15581366
 ] 

Debasish Das commented on SPARK-5992:
-

Also do you have hash function for euclidean distance?  We use cosine, jaccard 
and euclidean with SPARK-4823...for knn comparison we can use overlap 
metric...pick up k and then compare overlap within lsh based approximate knn 
and brute force knn...let me know if you need help in running the benchmarks...

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581361#comment-15581361
 ] 

Debasish Das commented on SPARK-5992:
-

Did you compare with brute force knn ? Normally lsh does not work well for nn 
queries and that's why hybrid spill trees and other ideas came alongI can 
run some comparisons using SPARK-4823

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581359#comment-15581359
 ] 

Debasish Das commented on SPARK-4823:
-

We use it in multiple usecases internally but did not get time to refactor the 
PR into 3 smaller PRsI will update the PR to 2.0

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-16 Thread Debasish Das
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
soon as I looked into it since compared to writing Java map-reduce and
Cascading code, Spark made writing distributed code fun...But now as we
went deeper with Spark and real-time streaming use-case gets more
prominent, I think it is time to bring a messaging model in conjunction
with the batch/micro-batch API that Spark is good atakka-streams close
integration with spark micro-batching APIs looks like a great direction to
stay in the game with Apache Flink...Spark 2.0 integrated streaming with
batch with the assumption is that micro-batching is sufficient to run SQL
commands on stream but do we really have time to do SQL processing at
streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation
and if you compare it with Spark documentation, I think we have major work
to do detailing out Spark internals so that more people from community
start to take active role in improving the issues so that Spark stays
strong compared to Flink.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals

Spark is no longer an engine that works for micro-batch and batch...We (and
I am sure many others) are pushing spark as an engine for stream and query
processing.we need to make it a state-of-the-art engine for high speed
streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
wrote:

> Hi everyone,
>
> I'm quite late with my answer, but I think my suggestions may help a
> little bit. :) Many technical and organizational topics were mentioned,
> but I want to focus on these negative posts about Spark and about "haters"
>
> I really like Spark. Easy of use, speed, very good community - it's
> everything here. But Every project has to "flight" on "framework market"
> to be still no 1. I'm following many Spark and Big Data communities,
> maybe my mail will inspire someone :)
>
> You (every Spark developer; so far I didn't have enough time to join
> contributing to Spark) has done excellent job. So why are some people
> saying that Flink (or other framework) is better, like it was posted in
> this mailing list? No, not because that framework is better in all
> cases.. In my opinion, many of these discussions where started after
> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
> posts, almost every post in "winned" by Flink. Answers are sometimes
> saying nothing about other frameworks, Flink's users (often PMC's) are
> just posting same information about real-time streaming, about delta
> iterations, etc. It look smart and very often it is marked as an aswer,
> even if - in my opinion - there wasn't told all the truth.
>
>
> My suggestion: I don't have enough money and knowledgle to perform huge
> performance test. Maybe some company, that supports Spark (Databricks,
> Cloudera? - just saying you're most visible in community :) ) could
> perform performance test of:
>
> - streaming engine - probably Spark will loose because of mini-batch
> model, however currently the difference should be much lower that in
> previous versions
>
> - Machine Learning models
>
> - batch jobs
>
> - Graph jobs
>
> - SQL queries
>
> People will see that Spark is envolving and is also a modern framework,
> because after reading posts mentioned above people may think "it is
> outdated, future is in framework X".
>
> Matei Zaharia posted excellent blog post about how Spark Structured
> Streaming beats every other framework in terms of easy-of-use and
> reliability. Performance tests, done in various environments (in
> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> cluster), could be also very good marketing stuff to say "hey, you're
> telling that you're better, but Spark is still faster and is still
> getting even more fast!". This would be based on facts (just numbers),
> not opinions. It would be good for companies, for marketing puproses and
> for every Spark developer
>
>
> Second: real-time streaming. I've written some time ago about real-time
> streaming support in Spark Structured Streaming. Some work should be
> done to make SSS more low-latency, but I think it's possible. Maybe
> Spark may look at Gearpump, which is also built on top of Akka? I don't
> know yet, it is good topic for SIP. However I think that Spark should
> have real-time streaming support. Currently I see many posts/comments
> that "Spark has too big latency". Spark Streaming is doing very good
> jobs with micro-batches, however I think it is possible to add also more
> real-time processing.
>
> Other people said much more and I agree with proposal of SIP. I'm also
> happy that PMC's are not saying that they will not listen to users, but
> they really want to make Spark better for every user.
>
>
> What do you think about these two topics? Especially I'm looking 

[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2016-08-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411023#comment-15411023
 ] 

Debasish Das commented on SPARK-6932:
-

[~rxin] [~sowen] Do we have any other active parameter server effort going on 
other than glint project from Rolf ? I have started to look into glint to scale 
Spark-as-a-Service to process queries (idea is that can we keep Spark master as 
a coordinator but 0 compute happens on Spark master other than coordination 
through messages, in our impl right now compute is happening on master which is 
a major con right now). More details will be covered in the talk 
https://spark-summit.org/eu-2016/events/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building/
 but I believe parameter server (or something similar) will be needed to scale 
query-processing further to Cassandra ring architecture for example...We will 
provide our implementation for spark-lucene integration as part of our 
framework (Trapezium) open source.


> A Prototype of Parameter Server
> ---
>
> Key: SPARK-6932
> URL: https://issues.apache.org/jira/browse/SPARK-6932
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, Spark Core
>Reporter: Qiping Li
>
>  h2. Introduction
> As specified in 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
> very helpful to integrate parameter server into Spark for machine learning 
> algorithms, especially for those with ultra high dimensions features. 
> After carefully studying the design doc of [Parameter 
> Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
>  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
> we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
> several key design concerns:
> * *User friendly interface*
>   Careful investigation is done to most existing Parameter Server 
> systems(including:  [petuum|http://petuum.github.io], [parameter 
> server|http://parameterserver.org], 
> [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
> design by absorbing essence from all these system. 
> * *Prototype of distributed array*
> IndexRDD (see 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
> to be a good option for distributed array, because in most case, the #key 
> updates/second is not be very high. 
> So we implement a distributed HashMap to store the parameters, which can 
> be easily extended to get better performance.
> 
> * *Minimal code change*
>   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
> which need parameter server are still created and scheduled by Spark's 
> scheduler. Tasks communicate with parameter server with a client object, 
> through *akka* or *netty*.
> With all these concerns we propose the following architecture:
> h2. Architecture
> !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
> Data is stored in RDD and is partitioned across workers. During each 
> iteration, each worker gets parameters from parameter server then computes 
> new parameters based on old parameters and data in the partition. Finally 
> each worker updates parameters to parameter server.Worker communicates with 
> parameter server through a parameter server client,which is initialized in 
> `TaskContext` of this worker.
> The current implementation is based on YARN cluster mode, 
> but it should not be a problem to transplanted it to other modes. 
> h3. Interface
> We refer to existing parameter server systems(petuum, parameter server, 
> paracel) when design the interface of parameter server. 
> *`PSClient` provides the following interface for workers to use:*
> {code}
> //  get parameter indexed by key from parameter server
> def get[T](key: String): T
> // get multiple parameters from parameter server
> def multiGet[T](keys: Array[String]): Array[T]
> // add parameter indexed by `key` by `delta`, 
> // if multiple `delta` to update on the same parameter,
> // use `reduceFunc` to reduce these `delta`s frist.
> def update[T](key: String, delta: T, reduceFunc: (T, T) => T): Unit
> // update multiple parameters at the same time, use the same `reduceFunc`.
> def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) => 
> T: Unit
> 
> // advance clock to indicate that current iteration is finished.
> def clock(): Unit
>  
> // block until all workers have reached this line of code.
> def sync(): Unit

Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
Hi Manoj,

I have a spark meetup talk that explains the issues with dimsum where you
have to calculate row similarities. You can still use the PR since it has
all the code you need but I have not got time to refactor it for the merge.
I believe few kernels are supported as well.

Thanks.
Deb
On Jul 7, 2016 8:13 PM, "Manoj Awasthi" <awasthi.ma...@gmail.com> wrote:

>
> Hi Debasish, All,
>
> I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't
> gather from the relevant pull request [1] if part of it is already in 1.6.0
> (it's closed now). We are facing the same problem of computing pairwise
> distances between vectors where rows are > 5M and columns in tens (20 to be
> specific). DIMSUM doesn't help because of obvious reasons (transposing the
> matrix infeasible) already discussed in JIRA.
>
> Is there an update on the JIRA ticket above and can I use something to
> compute RowSimilarity in spark 1.6.0 on my dataset? I will be thankful for
> any other ideas too on this.
>
> - Manoj
>
> [0] https://issues.apache.org/jira/browse/SPARK-4823
> [1] https://github.com/apache/spark/pull/6213
>
>
>
> On Thu, Apr 30, 2015 at 6:40 PM, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Thank you guys for the input.
>>
>> Ayan, I am not sure how this can be done using reduceByKey, as far as I
>> can see (but I am not so advanced with Spark), this requires a groupByKey
>> which can be very costly. What would be nice to transform the dataset which
>> contains all the vectors like:
>>
>>
>> val localData = data.zipWithUniqueId().map(_.swap) // Provide some keys
>> val cartesianProduct = localData.cartesian(localData) // Provide the pairs
>> val groupedByKey = cartesianProduct.groupByKey()
>>
>> val neighbourhoods = groupedByKey.map {
>>   case (point: (Long, VectorWithNormAndClass), points: Iterable[(Long,
>> VectorWithNormAndClass)]) => {
>> val distances = points.map {
>>   case (idxB: Long, pointB: VectorWithNormAndClass) =>
>> (idxB, MLUtils.fastSquaredDistance(point._2.vector,
>> point._2.norm, pointB.vector, pointB.norm))
>> }
>>
>> val kthDistance =
>> distances.sortBy(_._2).take(K).max(compareByDistance)
>>
>> (point, distances.filter(_._2 <= kthDistance._2))
>>   }
>> }
>>
>> This is part of my Local Outlier Factor implementation.
>>
>> Of course the distances can be sorted because it is an Iterable, but it
>> gives an idea. Is it possible to make this more efficient? I don't want to
>> use probabilistic functions, and I will cache the matrix because many
>> distances are looked up at the matrix, computing them on demand would
>> require far more computations.​
>>
>> ​​Kind regards,
>> Fokko
>>
>>
>>
>> 2015-04-30 4:39 GMT+02:00 Debasish Das <debasish.da...@gmail.com>:
>>
>>> Cross Join shuffle space might not be needed since most likely through
>>> application specific logic (topK etc) you can cut the shuffle space...Also
>>> most likely the brute force approach will be a benchmark tool to see how
>>> better is your clustering based KNN solution since there are several ways
>>> you can find approximate nearest neighbors for your application
>>> (KMeans/KDTree/LSH etc)...
>>>
>>> There is a variant that I will bring as a PR for this JIRA and we will
>>> of course look into how to improve it further...the idea is to think about
>>> distributed matrix multiply where both matrices A and B are distributed and
>>> master coordinates pulling a partition of A and multiply it with B...
>>>
>>> The idea suffices for kernel matrix generation as well if the number of
>>> rows are modest (~10M or so)...
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4823
>>>
>>>
>>> On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> This is my first thought, please suggest any further improvement:
>>>> 1. Create a rdd of your dataset
>>>> 2. Do an cross join to generate pairs
>>>> 3. Apply reducebykey and compute distance. You will get a rdd with
>>>> keypairs and distance
>>>>
>>>> Best
>>>> Ayan
>>>> On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote:
>>>>
>>>>> Dear Sparkers,
>>>>>
>>>>> I am working on an algorithm which requires the pair distance between
>>>>> all points (eg. DBScan

[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? I am extending OLS for sparse features and it 
will be great if you can point to the runtime experiments you have done...


was (Author: debasish83):
Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das commented on SPARK-9834:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers"  wrote:

> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom 
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>>> wrote:
>>>
 the re-use of shuffle files is always a nice surprise to me

 On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
 wrote:

> Same SparkContext means same pool of Workers.  It's up to the
> Scheduler, not the SparkContext, whether the exact same Workers or
> Executors will be used to calculate simultaneous actions against the same
> RDD.  It is likely that many of the same Workers and Executors will be 
> used
> as the Scheduler tries to preserve data locality, but that is not
> guaranteed.  In fact, what is most likely to happen is that the shared
> Stages and Tasks being calculated for the simultaneous actions will not
> actually be run at exactly the same time, which means that shuffle files
> produced for one action will be reused by the other(s), and repeated
> calculations will be avoided even without explicitly caching/persisting 
> the
> RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" 
>> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>> because
>>> they are read only object). however, I want to know if the same workers 
>>> are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>
 I stand corrected. How considerable are the benefits though? Will
 the scheduler be able to dispatch jobs from both actions 
 simultaneously (or
 on a when-workers-become-available basis)?

 On 15 January 2016 at 11:44, Koert Kuipers 
 wrote:

> we run multiple actions on the same (cached) rdd all the time, i
> guess in different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use
>> them this way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions
>> in parallel? The idea behind RDDs is to provide you with an 
>> abstraction for
>> computing parallel operations on distributed data. Even if you were 
>> to call
>> actions from several threads at once, the individual executors of 
>> your
>> spark environment would still have to perform operations 
>> sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single 
>> operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney > > wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira 
>>> escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes
 how can this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com 

Re: Using spark MLlib without installing Spark

2015-11-26 Thread Debasish Das
Decoupling mlllib and core is difficult...it is not intended to run spark
core 1.5 with spark mllib 1.6 snapshot...core is more stabilized due to new
algorithms getting added to mllib and sometimes you might be tempted to do
that but its not recommend.
On Nov 21, 2015 8:04 PM, "Reynold Xin"  wrote:

> You can use MLlib and Spark directly without "installing anything". Just
> run Spark in local mode.
>
>
> On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski 
> wrote:
>
>> Bowen,
>>
>> What Andy is doing in the notebook is a slightly different thing. He’s
>> using sbt to bring all spark jars (core, mllib, repl, what have you). You
>> could use maven for that. He then creates a repl and submits all the spark
>> code into it.
>> Pretty sure spark unit tests cover similar uses cases. Maybe not mllib
>> per se but this kind of submission.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Sunday, 22 November 2015 at 01:01, bowen zhang wrote:
>>
>> Thanks Rad for info. I looked into the repo and see some .snb file using
>> spark mllib. Can you give me a more specific place to look for when
>> invoking the mllib functions? What if I just want to invoke some of the ML
>> functions in my HelloWorld.java?
>>
>> --
>> *From:* Rad Gruchalski 
>> *To:* bowen zhang 
>> *Cc:* "dev@spark.apache.org" 
>> *Sent:* Saturday, November 21, 2015 3:43 PM
>> *Subject:* Re: Using spark MLlib without installing Spark
>>
>> Bowen,
>>
>> One project to look at could be spark-notebook:
>> https://github.com/andypetrella/spark-notebook
>> It uses Spark you in the way you intend to use it.
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>>
>> On Sunday, 22 November 2015 at 00:38, bowen zhang wrote:
>>
>> Hi folks,
>> I am a big fan of Spark's Mllib package. I have a java web app where I
>> want to run some ml jobs inside the web app. My question is: is there a way
>> to just import spark-core and spark-mllib jars to invoke my ML jobs without
>> installing the entire Spark package? All the tutorials related Spark seems
>> to indicate installing Spark is a pre-condition for this.
>>
>> Thanks,
>> Bowen
>>
>>
>>
>>
>>
>>
>


Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
Yeah for this you can use breeze quadratic minimizer...that's integrated
with spark in one of my spark pr.

You have quadratic objective with equality which is primal and your
proximal is positivity that we already support. I have not given an API for
linear objective but that should be simple to add. You can add an issue in
breeze for the enhancememt.

Alternatively you can use breeze lpsolver as well that uses simplex from
apache math.
On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com> wrote:

> Hi Debasish Das,
>
> Firstly I must show my deep appreciation towards you kind help.
>
> Yes, my issue is some typical LP related, it is as:
> Objective function:
> f(x1, x2, ..., xn) = a1 * x1 + a2 * x2 + ... + an * xn,   (n would be some
> number bigger than 100)
>
> There are only 4 constraint functions,
> x1 + x2 + ... + xn = 1, 1)
> b1 * x1 + b2 * x2 + ... + bn * xn = b, 2)
> c1 * x1 + c2 * x2 + ... + cn * xn = c, 3)
> x1, x2, ..., xn >= 0 .
>
> To find the solution of x which lets objective function the biggest.
>
> Since simplex method may not be supported by spark. Then I may switch to
> the way as, since the likely solution x must be on the boundary of 1), 2)
> and 3) geometry,
> that is to say, only three xi may be >= 0, all the others must be 0.
> Just look for all that kinds of solutions of 1), 2) and 3), the number
> would be C(n, 3) + C(n, 2) + C(n, 1), at last to select the most optimized
> one.
>
> Since the constraint number is not that large, I think this might be some
> way.
>
> Thank you,
> Zhiliang
>
>
> On Wednesday, November 4, 2015 2:25 AM, Debasish Das <
> debasish.da...@gmail.com> wrote:
>
>
> Spark has nnls in mllib optimization. I have refactored nnls to breeze as
> well but we could not move out nnls from mllib due to some runtime issues
> from breeze.
> Issue in spark or breeze nnls is that it takes dense gram matrix which
> does not scale if rank is high but it has been working fine for nnmf till
> 400 rank.
> I agree with Sean that you need to see if really simplex is needed. Many
> constraints can be formulated as proximal operator and then you can use
> breeze nonlinearminimizer or spark-tfocs package if it is stable.
> On Nov 2, 2015 10:13 AM, "Sean Owen" <so...@cloudera.com> wrote:
>
> I might be steering this a bit off topic: does this need the simplex
> method? this is just an instance of nonnegative least squares. I don't
> think it relates to LDA either.
>
> Spark doesn't have any particular support for NNLS (right?) or simplex
> though.
>
> On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Use breeze simplex which inturn uses apache maths simplex...if you want
> to
> > use interior point method you can use ecos
> > https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
> > quadratic solver in matrix factorization will show you example
> integration
> > with spark. ecos runs as jni process in every executor.
> >
> > On Nov 1, 2015 9:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid>
> wrote:
> >>
> >> Hi Ted Yu,
> >>
> >> Thanks very much for your kind reply.
> >> Do you just mean that in spark there is no specific package for simplex
> >> method?
> >>
> >> Then I may try to fix it by myself, do not decide whether it is
> convenient
> >> to finish by spark, before finally fix it.
> >>
> >> Thank you,
> >> Zhiliang
> >>
> >>
> >>
> >>
> >> On Monday, November 2, 2015 1:43 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>
> >>
> >> A brief search in code base shows the following:
> >>
> >> TODO: Add simplex constraints to allow alpha in (0,1).
> >> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
> >>
> >> I guess the answer to your question is no.
> >>
> >> FYI
> >>
> >> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid>
> >> wrote:
> >>
> >> Dear All,
> >>
> >> As I am facing some typical linear programming issue, and I know simplex
> >> method is specific in solving LP question,
> >> I am very sorry that whether there is already some mature package in
> spark
> >> about simplex method...
> >>
> >> Thank you very much~
> >> Best Wishes!
> >> Zhiliang
> >>
> >>
> >>
> >>
> >>
> >
>
>
>
>


Re: apply simplex method to fix linear programming in spark

2015-11-03 Thread Debasish Das
Spark has nnls in mllib optimization. I have refactored nnls to breeze as
well but we could not move out nnls from mllib due to some runtime issues
from breeze.

Issue in spark or breeze nnls is that it takes dense gram matrix which does
not scale if rank is high but it has been working fine for nnmf till 400
rank.

I agree with Sean that you need to see if really simplex is needed. Many
constraints can be formulated as proximal operator and then you can use
breeze nonlinearminimizer or spark-tfocs package if it is stable.
On Nov 2, 2015 10:13 AM, "Sean Owen" <so...@cloudera.com> wrote:

> I might be steering this a bit off topic: does this need the simplex
> method? this is just an instance of nonnegative least squares. I don't
> think it relates to LDA either.
>
> Spark doesn't have any particular support for NNLS (right?) or simplex
> though.
>
> On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Use breeze simplex which inturn uses apache maths simplex...if you want
> to
> > use interior point method you can use ecos
> > https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
> > quadratic solver in matrix factorization will show you example
> integration
> > with spark. ecos runs as jni process in every executor.
> >
> > On Nov 1, 2015 9:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid>
> wrote:
> >>
> >> Hi Ted Yu,
> >>
> >> Thanks very much for your kind reply.
> >> Do you just mean that in spark there is no specific package for simplex
> >> method?
> >>
> >> Then I may try to fix it by myself, do not decide whether it is
> convenient
> >> to finish by spark, before finally fix it.
> >>
> >> Thank you,
> >> Zhiliang
> >>
> >>
> >>
> >>
> >> On Monday, November 2, 2015 1:43 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>
> >>
> >> A brief search in code base shows the following:
> >>
> >> TODO: Add simplex constraints to allow alpha in (0,1).
> >> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
> >>
> >> I guess the answer to your question is no.
> >>
> >> FYI
> >>
> >> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid>
> >> wrote:
> >>
> >> Dear All,
> >>
> >> As I am facing some typical linear programming issue, and I know simplex
> >> method is specific in solving LP question,
> >> I am very sorry that whether there is already some mature package in
> spark
> >> about simplex method...
> >>
> >> Thank you very much~
> >> Best Wishes!
> >> Zhiliang
> >>
> >>
> >>
> >>
> >>
> >
>


Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Debasish Das
Use breeze simplex which inturn uses apache maths simplex...if you want to
use interior point method you can use ecos
https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
quadratic solver in matrix factorization will show you example integration
with spark. ecos runs as jni process in every executor.
On Nov 1, 2015 9:52 AM, "Zhiliang Zhu"  wrote:

> Hi Ted Yu,
>
> Thanks very much for your kind reply.
> Do you just mean that in spark there is no specific package for simplex
> method?
>
> Then I may try to fix it by myself, do not decide whether it is convenient
> to finish by spark, before finally fix it.
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, November 2, 2015 1:43 AM, Ted Yu  wrote:
>
>
> A brief search in code base shows the following:
>
> TODO: Add simplex constraints to allow alpha in (0,1).
> ./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
>
> I guess the answer to your question is no.
>
> FYI
>
> On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu 
> wrote:
>
> Dear All,
>
> As I am facing some typical linear programming issue, and I know simplex
> method is specific in solving LP question,
> I am very sorry that whether there is already some mature package in spark
> about simplex method...
>
> Thank you very much~
> Best Wishes!
> Zhiliang
>
>
>
>
>
>


Re: Running 2 spark application in parallel

2015-10-23 Thread Debasish Das
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...

For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on driver...have to double check...
On Oct 22, 2015 11:41 AM, "Simon Elliston Ball" 
wrote:

> If yarn has capacity to run both simultaneously it will. You should ensure
> you are not allocating too many executors for the first app and leave some
> space for the second)
>
> You may want to run the application on different yarn queues to control
> resource allocation. If you run as a different user within the same queue
> you should also get an even split between the applications, however you may
> need to enable preemption to ensure the first doesn't just hog the queue.
>
> Simon
>
> On 22 Oct 2015, at 19:20, Suman Somasundar 
> wrote:
>
> Hi all,
>
>
>
> Is there a way to run 2 spark applications in parallel under Yarn in the
> same cluster?
>
>
>
> Currently, if I submit 2 applications, one of them waits till the other
> one is completed.
>
>
>
> I want both of them to start and run at the same time.
>
>
>
> Thanks,
> Suman.
>
>


Re: RDD API patterns

2015-09-17 Thread Debasish Das
Rdd nesting can lead to recursive nesting...i would like to know the
usecase and why join can't support it...you can always expose an api over a
rdd and access that in another rdd mappartition...use a external data
source like hbase cassandra redis to support the api...

For ur case group by and then pass the logic...collect each group sample in
a seq and then lookup if u r doing one at a time...if doing all try joining
it...pattern is common if every key is a iid and you a cross validating a
model for each key on 80% train 20% test...

We are looking to fit it in pipeline flow...with minor mods it will fit..
On Sep 16, 2015 6:39 AM, "robineast"  wrote:

> I'm not sure the problem is quite as bad as you state. Both sampleByKey and
> sampleByKeyExact are implemented using a function from
> StratifiedSamplingUtils which does one of two things depending on whether
> the exact implementation is needed. The exact version requires double the
> number of lines of code (17) than the non-exact and has to do extra passes
> over the data to get, for example, the counts per key.
>
> As far as I can see your problem 2 and sampleByKeyExact are very similar
> and
> could be solved the same way. It has been decided that sampleByKeyExact is
> a
> widely useful function and so is provided out of the box as part of the
> PairRDD API. I don't see any reason why your problem 2 couldn't be provided
> in the same way as part of the API if there was the demand for it.
>
> An alternative design would perhaps be something like an extension to
> PairRDD, let's call it TwoPassPairRDD, where certain information for the
> key
> could be provided along with an Iterable e.g. the counts for the key. Both
> sampleByKeyExact and your problem 2 could be implemented in a few less
> lines
> of code.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735706#comment-14735706
 ] 

Debasish Das commented on SPARK-10408:
--

[~avulanov] In MLP can we change BFGS to OWLQN and get L1 regularization ? That 
way I can get sparse weights and clean up the network to avoid 
overfitting...For the autoencoder did you experiment with graphx based design ? 
I would like to work on it. Basically the idea is to come up with a N layer 
deep autoencoder that can support similar prediction APIs like matrix 
factorization.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das edited comment on SPARK-9834 at 9/8/15 3:18 PM:
-

[~mengxr] If you are open to use breeze.proximal.QuadraticMinimizer we can 
support elastic net in this variant as well...The flow will be very similar to 
QuadraticMinimizer integration to ALS...I have done runtime benchmarks compared 
to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges 
faster than OWLQN.

There are two new QuadraticMinimizer features I am working on which will 
further improve the solver:
1. sparse ldl through tim davis lgpl code and using breeze sparse matrix for 
sparse gram and conic formulations. Plan is to add it in breeze-native under 
lgpl similar to netlib-java integration.
2. admm acceleration using nesterov method. ADMM can be run in the same 
complexity as FISTA (implemented in TFOCS). Reference: 
http://www.optimization-online.org/DB_FILE/2009/12/2502.pdf

Although in practice I found even the ADMM implemented right now in 
QuadraticMinimizer converges faster than OWLQN. Tom in his paper demonstrated 
faster ADMM convergence compared to FISTA for quadratic problems: 
ftp://ftp.math.ucla.edu/pub/camreport/cam12-35.pdf. 

Due to the X^TX availability in these problems (ALS and linear regression) I 
also compute the min and max eigen values using power iteration 
(breeze.optimize.linear.PowerMethod) in the code which gives the Lipschitz 
estimator L and there is no line search overhead. This trick did not work for 
the nonlinear variant as the hessian estimates are not close to gram matrix !

QuadraticMinimizer is optimized to run at par with blas dposv when there are no 
constraints while BFGS/OWLQN both still have lot of overhead from iterators 
etc. That might also be the reason that I see QuadraticMinimizer is faster than 
BFGS/OWLQN.

It might be the right time to do the micro-benchmark as well that you asked for 
QuadraticMinimizer. Let me know what you think. I can finish up the 
micro-benchmark, bring the runtime of QuadraticMinimizer to ALS 
NormalEquationSolver and then start the L1 experiments.


was (Author: debasish83):
If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das commented on SPARK-9834:
-

If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze
owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1
regularization which will yield elastic net style sparse solutionsusing
that you can clean up edges which has 0.0 as weight...
On Sep 7, 2015 7:35 PM, "Feynman Liang"  wrote:

> BTW thanks for pointing out the typos, I've included them in my MLP
> cleanup PR 
>
> On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang 
> wrote:
>
>> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is
>> on the roadmap for 1.6
>>  though, and there is
>> a spark package
>>  for
>> dropout regularized logistic regression.
>>
>>
>> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> Thanks!
>>>
>>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>>> techniques that help avoiding overfitting?
>>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>>
>>> ps. There is a small copy-paste typo in
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>>> should read B :)
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>>> wrote:
>>>
 Backprop is used to compute the gradient here
 ,
 which is then optimized by SGD or LBFGS here
 

 On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> Haven't checked the actual code but that doc says "MLPC employes
> backpropagation for learning the model. .."?
>
>
>
> —
> Sent from Mailbox 
>
>
> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com> wrote:
>
>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>
>> Implementation seems missing backpropagation?
>> Was there is a good reason to omit BP?
>> What are the drawbacks of a pure feedforward-only ANN?
>>
>> Thanks!
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>

>>>
>>
>


[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2015-09-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734130#comment-14734130
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] will it be Breeze LBFGS modification or part of mllib.optimization ? 
Is  someone looking into it ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: SparkMeetup2015-Experiments2.pdf
SparkMeetup2015-Experiments1.pdf

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf, 
 SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.

I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's one more option I am looking into.

On Mon, Jul 27, 2015 at 10:12 PM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

  HBase in this case is no different from any other Spark SQL data
 sources, so yes you should be able to access HBase data through Astro from
 Spark SQL’s JDBC interface.



 Graphically, the access path is as follows:



 Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro
 Optimizer- HBase Scans/Gets - … - HBase Region server





 Regards,



 Yan



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Monday, July 27, 2015 10:02 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev; user
 *Subject:* RE: Package Release Annoucement: Spark SQL on HBase Astro



 Hi Yan,

 Is it possible to access the hbase table through spark sql jdbc layer ?

 Thanks.
 Deb

 On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.

I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's one more option I am looking into.

On Mon, Jul 27, 2015 at 10:12 PM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

  HBase in this case is no different from any other Spark SQL data
 sources, so yes you should be able to access HBase data through Astro from
 Spark SQL’s JDBC interface.



 Graphically, the access path is as follows:



 Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro
 Optimizer- HBase Scans/Gets - … - HBase Region server





 Regards,



 Yan



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Monday, July 27, 2015 10:02 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev; user
 *Subject:* RE: Package Release Annoucement: Spark SQL on HBase Astro



 Hi Yan,

 Is it possible to access the hbase table through spark sql jdbc layer ?

 Thanks.
 Deb

 On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan,

Is it possible to access the hbase table through spark sql jdbc layer ?

Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

  Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan,

Is it possible to access the hbase table through spark sql jdbc layer ?

Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

  Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
In your experience with using implicit factorization for document
clustering, how did you tune alpha ? Using perplexity measures or just
something simple like 1 + rating since the ratings are always positive in
this case

On Sun, Jul 26, 2015 at 1:23 AM, Sean Owen so...@cloudera.com wrote:

 It sounds like you're describing the explicit case, or any matrix
 decomposition. Are you sure that's best for count-like data? It
 depends, but my experience is that the implicit formulation is
 better. In a way, the difference between 10,000 and 1,000 count is
 less significant than the difference between 1 and 10. However if your
 loss function penalizes the square of the error, then the former case
 not only matters more for the same relative error, it matters 10x more
 than the latter. It's very heavily skewed to pay attention to the
 high-count instances.


 On Sun, Jul 26, 2015 at 9:19 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Yeah, I think the idea of confidence is a bit different than what I am
  looking for using implicit factorization to do document clustering.
 
  I basically need (r_ij - w_ih_j)^2 for all observed ratings and (0 -
  w_ih_j)^2 for all the unobserved ratings...Think about the document x
 word
  matrix where r_ij is the count that's observed, 0 are the word counts
 that
  are not in particular document.
 
  The broadcasted value of gram matrix w_i'wi or h_j'h_j will also count
 the
  r_ij those are observed...So I might be fine using the broadcasted gram
  matrix and use the linear term as \sum (-r_ijw_i) or \sum (-rijh_j)...
 
  I will think further but in the current implicit formulation with
  confidence, looks like I am really factorizing a 0/1 matrix with weights
 1 +
  alpha*rating for  . It's a bit different from LSA model.
 
 
 
 
 
  On Sun, Jul 26, 2015 at 12:34 AM, Sean Owen so...@cloudera.com wrote:
 
  confidence = 1 + alpha * |rating| here (so, c1 means confidence - 1),
  so alpha = 1 doesn't specially mean high confidence. The loss function
  is computed over the whole input matrix, including all missing 0
  entries. These have a minimal confidence of 1 according to this
  formula. alpha controls how much more confident you are in what the
  entries that do exist in the input mean. So alpha = 1 is low-ish and
  means you don't think the existence of ratings means a lot more than
  their absence.
 
  I think the explicit case is similar, but not identical -- here. The
  cost function for the explicit case is not the same, which is the more
  substantial difference between the two. There, ratings aren't inputs
  to a confidence value that becomes a weight in the loss function,
  during this factorization of a 0/1 matrix. Instead the rating matrix
  is the thing being factorized directly.
 
  On Sun, Jul 26, 2015 at 6:45 AM, Debasish Das debasish.da...@gmail.com
 
  wrote:
   Hi,
  
   Implicit factorization is important for us since it drives
   recommendation
   when modeling user click/no-click and also topic modeling to handle 0
   counts
   in document x word matrices through NMF and Sparse Coding.
  
   I am a bit confused on this code:
  
   val c1 = alpha * math.abs(rating)
   if (rating  0) ls.add(srcFactor, (c1 + 1.0)/c1, c1)
  
   When the alpha = 1.0 (high confidence) and rating is  0 (true for
 word
   counts), why this formula does not become same as explicit formula:
  
   ls.add(srcFactor, rating, 1.0)
  
   For modeling document, I believe implicit Y'Y needs to stay but we
 need
   explicit ls.add(srcFactor, rating, 1.0)
  
   I am understanding confidence code further. Please let me know if the
   idea
   of mapping implicit to handle 0 counts in document word matrix makes
   sense.
  
   Thanks.
   Deb
  
 
 



Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
I will think further but in the current implicit formulation with
confidence, looks like I am factorizing a 0/1 matrix with weights 1 +
alpha*rating for observed (1) values and 1 for unobserved (0) values. It's
a bit different from LSA model.


 On Sun, Jul 26, 2015 at 6:45 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  Implicit factorization is important for us since it drives
 recommendation
  when modeling user click/no-click and also topic modeling to handle 0
 counts
  in document x word matrices through NMF and Sparse Coding.
 
  I am a bit confused on this code:
 
  val c1 = alpha * math.abs(rating)
  if (rating  0) ls.add(srcFactor, (c1 + 1.0)/c1, c1)
 
  When the alpha = 1.0 (high confidence) and rating is  0 (true for word
  counts), why this formula does not become same as explicit formula:
 
  ls.add(srcFactor, rating, 1.0)
 
  For modeling document, I believe implicit Y'Y needs to stay but we need
  explicit ls.add(srcFactor, rating, 1.0)
 
  I am understanding confidence code further. Please let me know if the
 idea
  of mapping implicit to handle 0 counts in document word matrix makes
 sense.
 
  Thanks.
  Deb
 





Re: Confidence in implicit factorization

2015-07-26 Thread Debasish Das
We got good clustering results from Implicit factorization using alpha =
1.0 since I thought to have a confidence of 1 + rating to observed entries
and 1 to unobserved entries. I used positivity / sparse coding basically to
force sparsity on document / topic matrix...But then I got confused because
I am modifying the real counts from dataset (does not matter much for in
practical sense since we really don't have true documents)

I mean gram matrix is the key here but then how much weight to give on real
counts also matters...I have not yet started looking into perplexity but
that will give me further insights...

On Sun, Jul 26, 2015 at 1:23 AM, Sean Owen so...@cloudera.com wrote:

 It sounds like you're describing the explicit case, or any matrix
 decomposition. Are you sure that's best for count-like data? It
 depends, but my experience is that the implicit formulation is
 better. In a way, the difference between 10,000 and 1,000 count is
 less significant than the difference between 1 and 10. However if your
 loss function penalizes the square of the error, then the former case
 not only matters more for the same relative error, it matters 10x more
 than the latter. It's very heavily skewed to pay attention to the
 high-count instances.


 On Sun, Jul 26, 2015 at 9:19 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Yeah, I think the idea of confidence is a bit different than what I am
  looking for using implicit factorization to do document clustering.
 
  I basically need (r_ij - w_ih_j)^2 for all observed ratings and (0 -
  w_ih_j)^2 for all the unobserved ratings...Think about the document x
 word
  matrix where r_ij is the count that's observed, 0 are the word counts
 that
  are not in particular document.
 
  The broadcasted value of gram matrix w_i'wi or h_j'h_j will also count
 the
  r_ij those are observed...So I might be fine using the broadcasted gram
  matrix and use the linear term as \sum (-r_ijw_i) or \sum (-rijh_j)...
 
  I will think further but in the current implicit formulation with
  confidence, looks like I am really factorizing a 0/1 matrix with weights
 1 +
  alpha*rating for  . It's a bit different from LSA model.
 
 
 
 
 
  On Sun, Jul 26, 2015 at 12:34 AM, Sean Owen so...@cloudera.com wrote:
 
  confidence = 1 + alpha * |rating| here (so, c1 means confidence - 1),
  so alpha = 1 doesn't specially mean high confidence. The loss function
  is computed over the whole input matrix, including all missing 0
  entries. These have a minimal confidence of 1 according to this
  formula. alpha controls how much more confident you are in what the
  entries that do exist in the input mean. So alpha = 1 is low-ish and
  means you don't think the existence of ratings means a lot more than
  their absence.
 
  I think the explicit case is similar, but not identical -- here. The
  cost function for the explicit case is not the same, which is the more
  substantial difference between the two. There, ratings aren't inputs
  to a confidence value that becomes a weight in the loss function,
  during this factorization of a 0/1 matrix. Instead the rating matrix
  is the thing being factorized directly.
 
  On Sun, Jul 26, 2015 at 6:45 AM, Debasish Das debasish.da...@gmail.com
 
  wrote:
   Hi,
  
   Implicit factorization is important for us since it drives
   recommendation
   when modeling user click/no-click and also topic modeling to handle 0
   counts
   in document x word matrices through NMF and Sparse Coding.
  
   I am a bit confused on this code:
  
   val c1 = alpha * math.abs(rating)
   if (rating  0) ls.add(srcFactor, (c1 + 1.0)/c1, c1)
  
   When the alpha = 1.0 (high confidence) and rating is  0 (true for
 word
   counts), why this formula does not become same as explicit formula:
  
   ls.add(srcFactor, rating, 1.0)
  
   For modeling document, I believe implicit Y'Y needs to stay but we
 need
   explicit ls.add(srcFactor, rating, 1.0)
  
   I am understanding confidence code further. Please let me know if the
   idea
   of mapping implicit to handle 0 counts in document word matrix makes
   sense.
  
   Thanks.
   Deb
  
 
 



Confidence in implicit factorization

2015-07-25 Thread Debasish Das
Hi,

Implicit factorization is important for us since it drives recommendation
when modeling user click/no-click and also topic modeling to handle 0
counts in document x word matrices through NMF and Sparse Coding.

I am a bit confused on this code:

val c1 = alpha * math.abs(rating)
if (rating  0) ls.add(srcFactor, (c1 + 1.0)/c1, c1)

When the alpha = 1.0 (high confidence) and rating is  0 (true for word
counts), why this formula does not become same as explicit formula:

ls.add(srcFactor, rating, 1.0)

For modeling document, I believe implicit Y'Y needs to stay but we need
explicit ls.add(srcFactor, rating, 1.0)

I am understanding confidence code further. Please let me know if the idea
of mapping implicit to handle 0 counts in document word matrix makes sense.

Thanks.
Deb


Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

  We are happy to announce the availability of the Spark SQL on HBase
 1.0.0 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

  We are happy to announce the availability of the Spark SQL on HBase
 1.0.0 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





[akka-user] Re: ANNOUNCE: Akka Streams HTTP 1.0

2015-07-15 Thread Debasish Das
Hi,

First of all congratulations on the release of akka-streams and akka-http !

I am writing a service and spray was my initial choice but with akka-http 
and spray merge I am more inclined to start learning and using akka-http.

This service needs to manage a SparkContext and most likely Cassandra and 
ElastiSearch. SparkContext needs a dedicated master to start the job over a 
cluster manager like yarn / mesos that manages the workers while Cassandra 
works over a gossip protocol on a pool of nodes (I think it can be run on 
Mesos as well). akka clustering also uses a gossip protocol.

What's a good approach to design such a service ? I run a akka-http with 
clustering so that as load grows the service moves from running on one node 
to running on 10 nodes but now each node will maintain a separate spark 
context. I am not sure how can I share a spark context over a akka-http 
running using clustering. Most likely each node will maintain a separate 
spark context and that makes sense since the load grew from 1 users to 
10 users for example. The underlying data access can be shared through 
off heap storage like tachyon.

Also are there examples on using akka-http with clustering ? Spray had 
examples of running a spray service on multiple nodes. The same example 
will be valid for akka-http as well ? Pointers on using akka-http with 
clustering will be really helpful.

For akka streams, we would like to know how to compares with kafka and 
storm for example. Are there any use-cases where people have used akka 
streams in place of kafka ? akka stream is scala and most likely we can 
score a spark mllib model directly using features from akka-stream and 
there is no need to use DStream API as such. Is a connector available for 
akka-stream to Spark ? It will be great if we can extract storm style 
streaming latency using akka-stream !

Thanks.
Deb

On Wednesday, July 15, 2015 at 5:40:25 AM UTC-7, Konrad Malawski wrote:

 Dear hakkers,

 we—the Akka committers—are very pleased to announce the final release of 
 Akka Streams  HTTP 1.0. After countless hours and many months of work we 
 now consider Streams  HTTP good enough for evaluation and production use, 
 subject to the caveat on performance below. We will continue to improve the 
 implementation as well as to add features over the coming months, which 
 will be marked as 1.x releases—in particular concerning HTTPS support 
 (exposing certificate information per request and allowing session 
 renegotiation) and websocket client features—before we finally add these 
 new modules to the 2.4 development branch. In the meantime both Streams and 
 HTTP can be used with Akka 2.4 artifacts since these are binary backwards 
 compatibility with Akka 2.3.
 A Note on Performance

 Version 1.0 is fully functional but not yet optimized for performance. To 
 make it very clear: Spray currently is a lot faster at serving HTTP 
 responses than Akka HTTP is. We are aware of this and we know that a lot of 
 you are waiting to use it in anger for high-performance applications, but 
 we follow a “correctness first” approach. After 1.0 is released we will 
 start working on performance benchmarking and optimization, the focus of 
 the 1.1 release will be on closing the gap to Spray.
 What Changed since 1.0–RC4

- 

Plenty documentation improvements on advanced stages 
https://github.com/akka/akka/pull/17966, modularity 
https://github.com/akka/akka/issues/17337 and Http javadsl 
https://github.com/akka/akka/pull/17965,
- 

Improvements to Http stability under high load 
https://github.com/akka/akka/issues/17854,
- 

The streams cook-book translated to Java 
https://github.com/akka/akka/issues/16787,
- 

A number of new stream operators: recover 
https://github.com/akka/akka/pull/17998 and generalized UnzipWith 
https://github.com/akka/akka/pull/17998 contributed by Alexander 
Golubev,
- 

The javadsl for Akka Http https://github.com/akka/akka/pull/17988 is 
now nicer to use from Java 8 and when returning Futures,
- 

also Akka Streams and Http should now be properly packaged for OSGi 
https://github.com/akka/akka/pull/17979, thanks to Rafał Krzewski.

 The complete list of closed tickets can be found in the 1.0 milestones of 
 streams https://github.com/akka/akka/issues?q=milestone%3Astreams-1.0 
 and http https://github.com/akka/akka/issues?q=milestone%3Ahttp-1.0 on 
 github.
 Release Statistics

 Since the RC4 release:

- 

32 tickets closed
- 

252 files changed, 16861 insertions (+), 1834 deletions(-),
- 

… and a total of 9 contributors!

 commits added removed

   262342 335 Johannes Rudolph
   11   10112  97 Endre Sándor Varga
9 757 173 Martynas Mickevičius
82821 487 Konrad Malawski
3  28  49 2beaucoup
3 701 636 Viktor Klang
2  43  

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't  havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote:

 Sorry; I think I may have used poor wording. SparkR will let you use R to
 analyze the data, but it has to be loaded into memory using SparkR (see SparkR
 DataSources
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html).
 You will still have to write a Java receiver to store the data into some
 tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
 performing the analysis.

 R specific questions such as windowing in R should go to R-help@; you
 won't be able to use window since that is a Spark Streaming method.

 On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon o...@scene53.com wrote:

 You are helping me understanding stuff here a lot.

 I believe I have 3 last questions..

 If is use java receiver to get the data, how should I save it in memory?
 Using store command or other command?

 Once stored, how R can read that data?

 Can I use window command in R? I guess not because it is a streaming
 command, right? Any other way to window the data?

 Sent from IPhone




 On Mon, Jul 13, 2015 at 2:07 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  If you use SparkR then you can analyze the data that's currently in
 memory with R; otherwise you will have to write to disk (eg HDFS).

 On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon o...@scene53.com wrote:

 Thanks again.
 What I'm missing is where can I store the data? Can I store it in spark
 memory and then use R to analyze it? Or should I use hdfs? Any other places
 that I can save the data?

 What would you suggest?

 Thanks...

 Sent from IPhone




 On Mon, Jul 13, 2015 at 1:41 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  If you don't require true streaming processing and need to use R for
 analysis, SparkR on a custom data source seems to fit your use case.

 On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon o...@scene53.com wrote:

 Hi, thanks for replying!
 I want to do the entire process in stages. Get the data using Java or
 scala because they are the only Langs that supports custom receivers, 
 keep
 the data somewhere, use R to analyze it, keep the results somewhere,
 output the data to different systems.

 I thought that somewhere can be spark memory using rdd or
 dstreams.. But could it be that I need to keep it in hdfs to make the
 entire process in stages?

 Sent from IPhone




 On Mon, Jul 13, 2015 at 12:07 PM -0700, Feynman Liang 
 fli...@databricks.com wrote:

  Hi Oded,

 I'm not sure I completely understand your question, but it sounds
 like you could have the READER receiver produce a DStream which is
 windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
 However, streaming in SparkR is not currently supported (SPARK-6803
 https://issues.apache.org/jira/browse/SPARK-6803) so I'm not too
 sure how ANALYZER would fit in.

 Feynman

 On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon o...@scene53.com
 wrote:

 any help / idea will be appreciated :)
 thanks


 Regards,
 Oded Maimon
 Scene53.

 On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon o...@scene53.com
 wrote:

 Hi All,
 we are evaluating spark for real-time analytic. what we are trying
 to do is the following:

- READER APP- use custom receiver to get data from rabbitmq
(written in scala)
- ANALYZER APP - use spark R application to read the data
(windowed), analyze it every minute and save the results inside 
 spark
- OUTPUT APP - user spark application (scala/java/python) to
read the results from R every X minutes and send the data to few 
 external
systems

 basically at the end i would like to have the READER COMPONENT as
 an app that always consumes the data and keeps it in spark,
 have as many ANALYZER COMPONENTS as my data scientists wants, and
 have one OUTPUT APP that will read the ANALYZER results and send it 
 to any
 relevant system.

 what is the right way to do it?

 Thanks,
 Oded.





 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they 
 are
 addressed. Please note that any disclosure, copying or distribution of 
 the
 content of this information is strictly forbidden. If you have received
 this email message in error, please destroy it immediately and notify 
 its
 sender.*



 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they are
 addressed. Please note that any disclosure, copying or distribution of 
 the
 content of this information is strictly forbidden. If you have received
 this email message in error, please destroy it immediately and notify its
 sender.*



 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom 

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from
1000 users to 1 users ?

On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:

 I have almost the same case. I will tell you what I am actually doing, if
 it
 is according to your requirement, then I will love to help you.

 1. my database is aerospike. I get data from it.
 2. written standalone spark app (it does not run in standalone mode, but
 with simple java command or maven command), even I make its JAR file with
 simple java command and it runs spark without interactive shell.
 3. Write different methods based on different queries in standalone spark
 app.
 4. my restful API is based on NodeJS, user sends his request through
 NodeJS,
 request pass to my simple java maven project, execute particular spark
 method based on query, returns result to NodeJS in JSON format.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-application-with-a-RESTful-API-tp23654p23831.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
If you take bitmap indices out of sybase then I am guessing spark sql will
be at par with sybase ?

On that note are there plans of integrating indexed rdd ideas to spark sql
to build indices ? Is there a JIRA tracking it ?
On Jun 30, 2015 7:29 PM, Eric Pederson eric...@gmail.com wrote:

 Hi Debasish:

 We have the same dataset running on SybaseIQ and after the caches are warm
 the queries come back in about 300ms.  We're looking at options to relieve
 overutilization and to bring down licensing costs.  I realize that Spark
 may not be the best fit for this use case but I'm interested to see how far
 it can be pushed.

 Thanks for your help!


 -- Eric

 On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I got good runtime improvement from hive partitioninp, caching the
 dataset and increasing the cores through repartition...I think for your
 case generating mysql style indexing will help further..it is not supported
 in spark sql yet...

 I know the dataset might be too big for 1 node mysql but do you have a
 runtime estimate from running the same query on mysql with appropriate
 column indexing ? That should give us a good baseline number...

 For my case at least I could not put the data on 1 node mysql as it was
 big...

 If you can write the problem in a document view you can use a document
 store like solr/elastisearch to boost runtime...the reverse indices can get
 you subsecond latencies...again the schema design matters for that and you
 might have to let go some of sql expressiveness (like balance in a
 predefined bucket might be fine but looking for the exact number might be
 slow)





Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...

I know the dataset might be too big for 1 node mysql but do you have a
runtime estimate from running the same query on mysql with appropriate
column indexing ? That should give us a good baseline number...

For my case at least I could not put the data on 1 node mysql as it was
big...

If you can write the problem in a document view you can use a document
store like solr/elastisearch to boost runtime...the reverse indices can get
you subsecond latencies...again the schema design matters for that and you
might have to let go some of sql expressiveness (like balance in a
predefined bucket might be fine but looking for the exact number might be
slow)


Gossip protocol in Master selection

2015-06-28 Thread Debasish Das
Hi,

Akka cluster uses gossip protocol for Master election. The approach in
Spark right now is to use Zookeeper for high availability.

Interestingly Cassandra and Redis clusters are both using Gossip protocol.

I am not sure what is the default behavior right now. If the master dies
and zookeeper selects a new master, the whole depedency graph will be
re-executed or only the unfinished stages will be restarted ?

Also why the zookeeper based HA was preferred in Spark ? I was wondering if
there is JIRA to add gossip protocol for Spark Master election ?

In the code I see zookeeper, filesystem, custom and default is
MonarchyLeader. So looks like Spark is designed to add new
leaderElectionAgent.

Thanks.
Deb


Re: Velox Model Server

2015-06-24 Thread Debasish Das
Model sizes are 10m x rank, 100k x rank range.

For recommendation/topic modeling I can run batch recommendAll and then
keep serving the model using a distributed cache but then I can't
incorporate per user model re-predict if user feedback is making the
current topk stale. I have to wait for next batch refresh which might be 1
hr away.

spark job server + spark sql can get me fresh updates but each time running
a predict might be slow.

I am guessing the better idea might be to start with batch recommendAll and
then update the per user model if it get stale but that needs acess to the
key value store and the model over a API like spark job server. I am
running experiments with job server. In general it will be nice if my key
value store and model are both managed by same akka based API.

Yes sparksql is to filter/boost recommendation results using business logic
like user demography for example..
On Jun 23, 2015 2:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes, and typically needs are 100ms. Now imagine even 10 concurrent
 requests. My experience has been that this approach won't nearly
 scale. The best you could probably do is async mini-batch
 near-real-time scoring, pushing results to some store for retrieval,
 which could be entirely suitable for your use case.

 On Tue, Jun 23, 2015 at 8:52 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  If your recommendation needs are real-time (1s) I am not sure job server
  and computing the refs with spark will do the trick (though those new
  BLAS-based methods may have given sufficient speed up).



Re: Velox Model Server

2015-06-24 Thread Debasish Das
Thanks Nick, Sean for the great suggestions...

Since you guys have already hit these issues before I think it will be
great if we can add the learning to Spark Job Server and enhance it for
community.

Nick, do you see any major issues in using Spray over Scalatra ?

Looks like Model Server API layer needs access to a performant KV store
(Redis/Memcached), Elastisearch (we used Solr before for item-item serving
but I liked the Spark-Elastisearch integration, REST is Netty based unlike
Solr's Jetty and YARN client looks more stable and so it is worthwhile to
see if it improves over Solr based serving) and ML Models (which are moving
towards Spark SQL style in 1.3/1.4 with the introduction of Pipeline API)

An initial version of KV store might be simple LRU cache.

For KV store are there any comparisons available with IndexedRDD and
Redis/Memcached ?

Velox is using CoreOS EtcdClient (which is Go based) but I am not sure if
it is used as a full fledged distributed cache or not. May be it is being
used as zookeeper alternative.


On Wed, Jun 24, 2015 at 2:02 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Ok

 My view is with only 100k items, you are better off serving in-memory
 for items vectors. i.e. store all item vectors in memory, and compute user
 * item score on-demand. In most applications only a small proportion of
 users are active, so really you don't need all 10m user vectors in memory.
 They could be looked up from a K-V store and have an LRU cache in memory
 for say 1m of those. Optionally also update them as feedback comes in.

 As far as I can see, this is pretty much what velox does except it
 partitions all user vectors across nodes to scale.

 Oryx does almost the same but Oryx1 kept all user and item vectors in
 memory (though I am not sure about whether Oryx2 still stores all user and
 item vectors in memory or partitions in some way).

 Deb, we are using a custom Akka-based model server (with Scalatra
 frontend). It is more focused on many small models in-memory (largest of
 these is around 5m user vectors, 100k item vectors, with factor size
 20-50). We use Akka cluster sharding to allow scale-out across nodes if
 required. We have a few hundred models comfortably powered by m3.xlarge AWS
 instances. Using floats you could probably have all of your factors in
 memory on one 64GB machine (depending on how many models you have).

 Our solution is not that generic and a little hacked-together - but I'd be
 happy to chat offline about sharing what we've done. I think it still has a
 basic client to the Spark JobServer which would allow triggering
 re-computation jobs periodically. We currently just run batch
 re-computation and reload factors from S3 periodically.

 We then use Elasticsearch to post-filter results and blend content-based
 stuff - which I think might be more efficient than SparkSQL for this
 particular purpose.

 On Wed, Jun 24, 2015 at 8:59 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Model sizes are 10m x rank, 100k x rank range.

 For recommendation/topic modeling I can run batch recommendAll and then
 keep serving the model using a distributed cache but then I can't
 incorporate per user model re-predict if user feedback is making the
 current topk stale. I have to wait for next batch refresh which might be 1
 hr away.

 spark job server + spark sql can get me fresh updates but each time
 running a predict might be slow.

 I am guessing the better idea might be to start with batch recommendAll
 and then update the per user model if it get stale but that needs acess to
 the key value store and the model over a API like spark job server. I am
 running experiments with job server. In general it will be nice if my key
 value store and model are both managed by same akka based API.

 Yes sparksql is to filter/boost recommendation results using business
 logic like user demography for example..
 On Jun 23, 2015 2:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes, and typically needs are 100ms. Now imagine even 10 concurrent
 requests. My experience has been that this approach won't nearly
 scale. The best you could probably do is async mini-batch
 near-real-time scoring, pushing results to some store for retrieval,
 which could be entirely suitable for your use case.

 On Tue, Jun 23, 2015 at 8:52 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  If your recommendation needs are real-time (1s) I am not sure job
 server
  and computing the refs with spark will do the trick (though those new
  BLAS-based methods may have given sufficient speed up).





Spark SQL 1.3 Exception

2015-06-24 Thread Debasish Das
Hi,

I have Impala created table with the following io format and serde:

inputFormat:parquet.hive.DeprecatedParquetInputFormat,
outputFormat:parquet.hive.DeprecatedParquetOutputFormat,
serdeInfo:SerDeInfo(name:null,
serializationLib:parquet.hive.serde.ParquetHiveSerDe, parameters:{})
I am trying to read this table on Spark SQL 1.3 and see if caching improves
my query latency but I am getting exception:

java.lang.ClassNotFoundException: Class parquet.hive.serde.ParquetHiveSerDe
not found
I understand that in hive 0.13 (which I am using)
parquet.hive.serde.ParquetHiveSerDe is deprecated but it seems Impala still
used it to write the table.

I also tried to provide the bundle jar with --jars option to Spark 1.3
Shell / SQL which has org.apache.parquet.hive.serde.ParquetHiveSerDe but I
am confused how to configure to serde in SQLContext ?

The table which has the following io format and serde can be read fine by
Spark SQL 1.3:

inputFormat=org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,
outputFormat=org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,
serializationLib=org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Thanks.
Deb

On Sat, Jun 20, 2015 at 12:21 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi,

 I have some impala created parquet tables which hive 0.13.2 can read fine.

 Now the same table when I want to read using Spark SQL 1.3 I am getting
 exception class exception that parquet.hive.serde.ParquetHiveSerde not
 found.

 I am assuming that hive somewhere is putting the parquet-hive-bundle.jar
 in hive classpath but I tried putting the parquet-hive-bundle.jar in
 spark-1.3/conf/hive-site.xml through auxillary jar but even that did not
 work.

 Any input on fixing this will be really helpful.

 Thanks.
 Deb



Re: Velox Model Server

2015-06-22 Thread Debasish Das
Models that I am looking for are mostly factorization based models (which
includes both recommendation and topic modeling use-cases).
For recommendation models, I need a combination of Spark SQL and ml model
prediction api...I think spark job server is what I am looking for and it
has fast http rest backend through spray which will scale fine through akka.

Out of curiosity why netty?
What model are you serving?
Velox doesn't look like it is optimized for cases like ALS recs, if that's
what you mean. I think scoring ALS at scale in real time takes a fairly
different approach.
The servlet engine probably doesn't matter at all in comparison.

On Sat, Jun 20, 2015, 9:40 PM Debasish Das debasish.da...@gmail.com wrote:

 After getting used to Scala, writing Java is too much work :-)

 I am looking for scala based project that's using netty at its core (spray
 is one example).

 prediction.io is an option but that also looks quite complicated and not
 using all the ML features that got added in 1.3/1.4

 Velox built on top of ML / Keystone ML pipeline API and that's useful but
 it is still using javax servlets which is not netty based.

 On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server
 component at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or
 it uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build
 upon.

 Thanks.
 Deb



 --
 - Charles







Impala created parquet tables

2015-06-20 Thread Debasish Das
Hi,

I have some impala created parquet tables which hive 0.13.2 can read fine.

Now the same table when I want to read using Spark SQL 1.3 I am getting
exception class exception that parquet.hive.serde.ParquetHiveSerde not
found.

I am assuming that hive somewhere is putting the parquet-hive-bundle.jar in
hive classpath but I tried putting the parquet-hive-bundle.jar in
spark-1.3/conf/hive-site.xml through auxillary jar but even that did not
work.

Any input on fixing this will be really helpful.

Thanks.
Deb


Velox Model Server

2015-06-20 Thread Debasish Das
Hi,

The demo of end-to-end ML pipeline including the model server component at
Spark Summit was really cool.

I was wondering if the Model Server component is based upon Velox or it
uses a completely different architecture.

https://github.com/amplab/velox-modelserver

We are looking for an open source version of model server to build upon.

Thanks.
Deb


Re: Welcoming some new committers

2015-06-20 Thread Debasish Das
Congratulations to All.

DB great work in bringing quasi newton methods to Spark !

On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen ches...@alpinenow.com wrote:

 Congratulations to All.

 DB and Sandy, great works !


 On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hey all,

 Over the past 1.5 months we added a number of new committers to the
 project, and I wanted to welcome them now that all of their respective
 forms, accounts, etc are in. Join me in welcoming the following new
 committers:

 - Davies Liu
 - DB Tsai
 - Kousuke Saruta
 - Sandy Ryza
 - Yin Huai

 Looking forward to more great contributions from all of these folks.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Velox Model Server

2015-06-20 Thread Debasish Das
Hi,

The demo of end-to-end ML pipeline including the model server component at
Spark Summit was really cool.

I was wondering if the Model Server component is based upon Velox or it
uses a completely different architecture.

https://github.com/amplab/velox-modelserver

We are looking for an open source version of model server to build upon.

Thanks.
Deb


Re: Velox Model Server

2015-06-20 Thread Debasish Das
Integration of model server with ML pipeline API.

On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto don...@prediction.io wrote:

 Mind if I ask what 1.3/1.4 ML features that you are looking for?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote:

 After getting used to Scala, writing Java is too much work :-)

 I am looking for scala based project that's using netty at its core
 (spray is one example).

 prediction.io is an option but that also looks quite complicated and not
 using all the ML features that got added in 1.3/1.4

 Velox built on top of ML / Keystone ML pipeline API and that's useful but
 it is still using javax servlets which is not netty based.

 On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server
 component at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or
 it uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build
 upon.

 Thanks.
 Deb



 --
 - Charles






 --
 Donald Szeto
 PredictionIO




Re: Velox Model Server

2015-06-20 Thread Debasish Das
After getting used to Scala, writing Java is too much work :-)

I am looking for scala based project that's using netty at its core (spray
is one example).

prediction.io is an option but that also looks quite complicated and not
using all the ML features that got added in 1.3/1.4

Velox built on top of ML / Keystone ML pipeline API and that's useful but
it is still using javax servlets which is not netty based.

On Sat, Jun 20, 2015 at 10:25 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Oops, that link was for Oryx 1. Here's the repo for Oryx 2:
 https://github.com/OryxProject/oryx

 On Sat, Jun 20, 2015 at 10:20 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Debasish,

 The Oryx project (https://github.com/cloudera/oryx), which is Apache 2
 licensed, contains a model server that can serve models built with MLlib.

 -Sandy

 On Sat, Jun 20, 2015 at 8:00 AM, Charles Earl charles.ce...@gmail.com
 wrote:

 Is velox NOT open source?


 On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 The demo of end-to-end ML pipeline including the model server component
 at Spark Summit was really cool.

 I was wondering if the Model Server component is based upon Velox or it
 uses a completely different architecture.

 https://github.com/amplab/velox-modelserver

 We are looking for an open source version of model server to build upon.

 Thanks.
 Deb



 --
 - Charles






Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.

On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:

 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
 dgemm based calculation.

 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine. In
 my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper caching
 and use of broadcast variables this will take a while defending on the size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.

On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine. In
 my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper caching
 and use of broadcast variables this will take a while defending on the size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed. If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited. If you
 have received this communication in error, please contact the sender and
 delete the material from your computer.





 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++





Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:

https://issues.apache.org/jira/browse/SPARK-4823

On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Also not sure how threading helps here because Spark puts a partition to
 each core. On each core may be there are multiple threads if you are using
 intel hyperthreading but I will let Spark handle the threading.

 On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
 dgemm based calculation.

 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine.
 In my case, there can be even more products. You simply broadcast blocks of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a recommendProducts
 method that blockifies the factors to make use of level 3 BLAS (ie
 matrix-matrix multiply). I am not sure if this is available in The Python
 api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper 
 caching
 and use of broadcast variables this will take a while defending on the 
 size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about
 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication
 where
 I first get the product features, broadcast them and then do a dot
 product.
 Its still very slow. Any reason why
 here is a sample code

 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :
   myprod = numpy.dot(x,pf.value[i][1])
   a.append(myprod)
 return a


 myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
 #I need to select which products to broadcast but lets try all
 m1 = myModel.productFeatures().sample(False, 0.001)
 pf = sc.broadcast(m1.collect())
 uf = myModel.userFeatures()
 f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 --
 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates and may only be used
 solely in performance of work or services for Capital One. The information
 transmitted herewith is intended only for use by the individual or entity

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of
interesting features as well...
On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote:

 We don't have it in MLlib. The closest would be the ChiSqSelector,
 which works for categorical data. -Xiangrui

 On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:
  What would be closest equivalent in MLLib to Oracle Data Miner's
 Attribute
  Importance mining function?
 
 
 http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
 
  Attribute importance is a supervised function that ranks attributes
  according to their significance in predicting a target.
 
 
  Best regards,
  Ruslan Dautkhanov

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das commented on SPARK-2336:
-

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das edited comment on SPARK-2336 at 6/12/15 6:51 PM:
--

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities 
https://github.com/apache/spark/pull/6213 for brute force KNN generation which 
we will use to compare the QoR of your PR as soon as you open up a stable 
version.



was (Author: debasish83):
Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem
scale permits as there are guarantees on opti mization...owlqn and bfgs are
both quasi newton

Most single node code bases will run quasi newton solvesif you are
using sgd better is to use adadelta/adagrad or similar tricks...David added
some of them in breeze recently...
On Jun 9, 2015 7:25 PM, DB Tsai dbt...@dbtsai.com wrote:

 As Robin suggested, you may try the following new implementation.


 https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef

 Thanks.

 Sincerely,

 DB Tsai
 --
 Blog: https://www.dbtsai.com
 PGP Key ID: 0xAF08DF8D
 https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D

 On Tue, Jun 9, 2015 at 3:22 PM, Robin East robin.e...@xense.co.uk wrote:

 Hi Stephen

 How many is a very large number of iterations? SGD is notorious for
 requiring 100s or 1000s of iterations, also you may need to spend some time
 tweaking the step-size. In 1.4 there is an implementation of ElasticNet
 Linear Regression which is supposed to compare favourably with an
 equivalent R implementation.
  On 9 Jun 2015, at 22:05, Stephen Carman scar...@coldlight.com wrote:
 
  Hi User group,
 
  We are using spark Linear Regression with SGD as the optimization
 technique and we are achieving very sub-optimal results.
 
  Can anyone shed some light on why this implementation seems to produce
 such poor results vs our own implementation?
 
  We are using a very small dataset, but we have to use a very large
 number of iterations to achieve similar results to our implementation,
 we’ve tried normalizing the data
  not normalizing the data and tuning every param. Our implementation is
 a closed form solution so we should be guaranteed convergence but the spark
 one is not, which is
  understandable, but why is it so far off?
 
  Has anyone experienced this?
 
  Steve Carman, M.S.
  Artificial Intelligence Engineer
  Coldlight-PTC
  scar...@coldlight.com
  This e-mail is intended solely for the above-mentioned recipient and it
 may contain confidential or privileged information. If you have received it
 in error, please notify us immediately and delete the e-mail. You must not
 copy, distribute, disclose or take any action in reliance on it. In
 addition, the contents of an attachment to this e-mail may contain software
 viruses which could damage your own computer system. While ColdLight
 Solutions, LLC has taken every reasonable precaution to minimize this risk,
 we cannot accept liability for any damage which you sustain as a result of
 software viruses. You should perform your own virus checks before opening
 the attachment.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Is there an existing way in SparkML to convert a decision tree to a
 decision list?

 On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh r...@databricks.com wrote:

 The closest algorithm to decision lists that we have is decision trees
 https://spark.apache.org/docs/latest/mllib-decision-tree.html

 On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 I have used weka machine learning library for generating a model for my
 training set. I have used the PART algorithm (decision lists) from weka.

 Now, I would like to use spark ML for the PART algo for my training set
 and could not seem to find a parallel. Could anyone point out the
 corresponding algorithm or even if its available in Spark ML?

 Thanks,
 Sateesh






Streaming data + Blocked Model

2015-05-28 Thread Debasish Das
Hi,

We want to keep the model created and loaded in memory through Spark batch
context since blocked matrix operations are required to optimize on runtime.

The data is streamed in through Kafka / raw sockets and Spark Streaming
Context. We want to run some prediction operations with the streaming data
and model loaded in memory through batch context.

Do I need to open up a API on top of the batch context or it is possible to
use a RDD created by batch context through streaming context ?

Most likely not since both streaming context and batch context can't exist
in the same spark job but I am curious.

If I have to open up an API, does it makes sense to come up with a generic
serving api for mllib and let all mllib algorithms expose a serving API ?
The API can be spawned using Spark's actor system itself specially since
spray is merging to akka-httpx and akka is a dependency in spark already.

May be it's not a good idea since it needs maintaining another actor system
for the API.

Thanks.
Deb


[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-28 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Affects Version/s: (was: 1.4.0)

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Debasish Das
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: spark packages

2015-05-24 Thread Debasish Das
Yup netlib lgpl right now is activated through a profile...if we can reuse
the same idea then csparse can also be added to spark with a lgpl flag. But
again as Sean said its tricky. Better to keep it on spark packages for
users to try.
On May 24, 2015 1:36 AM, Sean Owen so...@cloudera.com wrote:

 I dont believe we are talking about adding things to the Apache project,
 but incidentally LGPL is not OK in Apache projects either.
 On May 24, 2015 6:12 AM, DB Tsai dbt...@dbtsai.com wrote:

 I thought LGPL is okay but GPL is not okay for Apache project.

 On Saturday, May 23, 2015, Patrick Wendell pwend...@gmail.com wrote:

 Yes - spark packages can include non ASF licenses.

 On Sat, May 23, 2015 at 6:16 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  Is it possible to add GPL/LGPL code on spark packages or it must be
 licensed
  under Apache as well ?
 
  I want to expose Professor Tim Davis's LGPL library for sparse algebra
 and
  ECOS GPL library through the package.
 
  Thanks.
  Deb

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



 --
 Sent from my iPhone




Re: Kryo option changed

2015-05-24 Thread Debasish Das
I am May 3rd commit:

commit 49549d5a1a867c3ba25f5e4aec351d4102444bc0

Author: WangTaoTheTonic wangtao...@huawei.com

Date:   Sun May 3 00:47:47 2015 +0100


[SPARK-7031] [THRIFTSERVER] let thrift server take SPARK_DAEMON_MEMORY
and SPARK_DAEMON_JAVA_OPTS

On Sat, May 23, 2015 at 7:54 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which commit of master are you building off?  It looks like there was a
 bugfix for an issue related to KryoSerializer buffer configuration:
 https://github.com/apache/spark/pull/5934

 That patch was committed two weeks ago, but you mentioned that you're
 building off a newer version of master.  Could you confirm the commit that
 you're running?  If this used to work but now throws an error, then this is
 a regression that should be fixed; we shouldn't require you to perform a mb
 - kb conversion to work around this.

 On Sat, May 23, 2015 at 6:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 Pardon me.

 Please use '8192k'

 Cheers

 On Sat, May 23, 2015 at 6:24 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Tried 8mb...still I am failing on the same error...

 On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. it shuld be 8mb

 Please use the above syntax.

 Cheers

 On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 Hi,

 I am on last week's master but all the examples that set up the
 following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax
 to get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb








Kryo option changed

2015-05-23 Thread Debasish Das
Hi,

I am on last week's master but all the examples that set up the following

.set(spark.kryoserializer.buffer, 8m)

are failing with the following error:

Exception in thread main java.lang.IllegalArgumentException:
spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
looks like buffer.mb is deprecated...Is 8m is not the right syntax to get
8mb kryo buffer or it shuld be 8mb

Thanks.
Deb


spark packages

2015-05-23 Thread Debasish Das
Hi,

Is it possible to add GPL/LGPL code on spark packages or it must be
licensed under Apache as well ?

I want to expose Professor Tim Davis's LGPL library for sparse algebra and
ECOS GPL library through the package.

Thanks.
Deb


Re: Kryo option changed

2015-05-23 Thread Debasish Das
Tried 8mb...still I am failing on the same error...

On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. it shuld be 8mb

 Please use the above syntax.

 Cheers

 On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 I am on last week's master but all the examples that set up the following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax to
 get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb





[jira] [Updated] (SPARK-4823) rowSimilarities

2015-05-23 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: MovieLensSimilarity Comparisons.pdf

The attached file shows the runtime comparison of row and column based flow on 
all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 
4 GB executor memory.

1e-2 is the threshold that's being set to both row based kernel flow and column 
based dimsum flow. 

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s

Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

This shows the power of blocking in Spark and I have not yet gone to gemv which 
will decrease the runtime further.

I updated the driver code in examples.mllib.MovieLensSimilarity  




 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-05-23 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557416#comment-14557416
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] Should I add the PR to spark packages and close the JIRA ? The main 
contribution was to add sparsity constraints (L1 and probability simplex) to 
user and product factors in implicit and explicit feedback factorization and 
interested users can use the features from spark packages if they need...Later 
if there is community interest, we can pull it in to master ALS ?

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Power iteration clustering

2015-05-23 Thread Debasish Das
Hi,

What was the motivation to write power iteration clustering using graphx
and not a vector matrix multiplication over similarity matrix represented
as say coordinate matrix ?

We can use gemv in that flow to block the computation.

Over graphx can we do all k eigen vector computation together because I
don't see that in a vector matrix multiply flow ? On the other side vector
matrix multiply flow is generic for kernel regression or classification
flows.

Thanks.
Deb


  1   2   3   4   5   >