Hi Vibhor,
We worked on a project to create lucene indexes using spark but the project
has not been managed for some time now. If there is interest we can
resurrect it
Open source impl of dremel is parquet !
On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta
wrote:
> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the
Hi,
ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve
You can run l
On May 15, 2017 3:29 PM, "Nipun Arora" wrote:
> Thanks all for your response. I will have a look at them.
>
> Nipun
>
> On Sat, May 13, 2017 at 2:38 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> It's in scala but it should be
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote:
> Hi everyone,
> Since spark 2.1.0
y to call predict on single vector.
> There is no API exposed. It is WIP but not yet released.
>
> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> If we expose an API to access the raw models out of PipelineModel can't
>> we call predict direc
, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das" <debasish.da...@gmail.com> wrote:
> If we expose an API to access the raw models out of PipelineMo
res to score through spark.ml.Model
>predict API". The predict API is in the old mllib package not the new ml
>package.
>- "why r we using dataframe and not the ML model directly from API" -
>Because as of now the new ml package does not have the direct API.
I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
You may want to pull up release/1.2 branch and 1.2.0 tag to build it
yourself incase the packages are not available.
On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim"
wrote:
> Hi Ayan,
>
> Thanks a million.
>
> Regards,
> _
> *Md. Rezaul
> gives an idea. Is it possible to make this more efficient? I don't want to
>> use probabilistic functions, and I will cache the matrix because many
>> distances are looked up at the matrix, computing them on demand would
>> require far more computations.
>>
>>
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers" wrote:
> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom
to add. You can add an issue in
breeze for the enhancememt.
Alternatively you can use breeze lpsolver as well that uses simplex from
apache math.
On Nov 4, 2015 1:05 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com> wrote:
> Hi Debasish Das,
>
> Firstly I must show my deep appreciat
>
> On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Use breeze simplex which inturn uses apache maths simplex...if you want
> to
> > use interior point method you can use ecos
> > https://github.com/embotech/ecos-java-scala ...
Use breeze simplex which inturn uses apache maths simplex...if you want to
use interior point method you can use ecos
https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on
quadratic solver in matrix factorization will show you example integration
with spark. ecos runs as jni
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on
the same spark context you created (executors and cores)...same idea is
used for spark sql thriftserver flow...
For streaming i think it lets you run only one stream at a time even if you
run them on multiple threads on
Not sure dropout but if you change the solver from breeze bfgs to breeze
owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1
regularization which will yield elastic net style sparse solutionsusing
that you can clean up edges which has 0.0 as weight...
On Sep 7, 2015 7:35
, the access path is as follows:
Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro
Optimizer- HBase Scans/Gets - … - HBase Region server
Regards,
Yan
*From:* Debasish Das [mailto:debasish.da...@gmail.com]
*Sent:* Monday, July 27, 2015 10:02 PM
*To:* Yan Zhou.sc
Hi Yan,
Is it possible to access the hbase table through spark sql jdbc layer ?
Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Yes, but not all SQL-standard insert variants .
*From:* Debasish Das [mailto:debasish.da...@gmail.com]
*Sent:* Wednesday, July 22
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase
1.0.0 release.
http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this
What do you need in sparkR that mllib / ml don't havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote:
Sorry; I think I may have used poor wording. SparkR will let you use R to
How do you manage the spark context elastically when your load grows from
1000 users to 1 users ?
On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:
I have almost the same case. I will tell you what I am actually doing, if
it
is according to your requirement,
how far
it can be pushed.
Thanks for your help!
-- Eric
On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
wrote:
I got good runtime improvement from hive partitioninp, caching the
dataset and increasing the cores through repartition...I think for your
case
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...
I know the dataset might be too big for 1 node mysql but do
Model sizes are 10m x rank, 100k x rank range.
For recommendation/topic modeling I can run batch recommendAll and then
keep serving the model using a distributed cache but then I can't
incorporate per user model re-predict if user feedback is making the
current topk stale. I have to wait for next
and reload factors from S3 periodically.
We then use Elasticsearch to post-filter results and blend content-based
stuff - which I think might be more efficient than SparkSQL for this
particular purpose.
On Wed, Jun 24, 2015 at 8:59 AM, Debasish Das debasish.da...@gmail.com
wrote:
Model sizes
engine probably doesn't matter at all in comparison.
On Sat, Jun 20, 2015, 9:40 PM Debasish Das debasish.da...@gmail.com wrote:
After getting used to Scala, writing Java is too much work :-)
I am looking for scala based project that's using netty at its core (spray
is one example
Hi,
The demo of end-to-end ML pipeline including the model server component at
Spark Summit was really cool.
I was wondering if the Model Server component is based upon Velox or it
uses a completely different architecture.
https://github.com/amplab/velox-modelserver
We are looking for an open
Integration of model server with ML pipeline API.
On Sat, Jun 20, 2015 at 12:25 PM, Donald Szeto don...@prediction.io wrote:
Mind if I ask what 1.3/1.4 ML features that you are looking for?
On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com wrote:
After getting used to Scala
charles.ce...@gmail.com
wrote:
Is velox NOT open source?
On Saturday, June 20, 2015, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
The demo of end-to-end ML pipeline including the model server component
at Spark Summit was really cool.
I was wondering if the Model Server component is based
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.
On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:
We
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.
On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:
Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman
On Jun
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:
https://issues.apache.org/jira/browse/SPARK-4823
On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:
Also not sure how
Running l1 and picking non zero coefficient s gives a good estimate of
interesting features as well...
On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote:
We don't have it in MLlib. The closest would be the ChiSqSelector,
which works for categorical data. -Xiangrui
On Thu, Jun 11,
It's always better to use a quasi newton solver if the runtime and problem
scale permits as there are guarantees on opti mization...owlqn and bfgs are
both quasi newton
Most single node code bases will run quasi newton solvesif you are
using sgd better is to use adadelta/adagrad or similar
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:
Is there an existing way in SparkML to convert a decision tree to a
decision list?
On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh
SparkSQL was built to improve upon Hive on Spark runtime further...
On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk
guoqing0...@yahoo.com.hk wrote:
Hive on Spark and SparkSQL which should be better , and what are the key
characteristics and the advantages and the disadvantages
The batch version of this is part of rowSimilarities JIRA 4823 ...if your
query points can fit in memory there is broadcast version which we are
experimenting with internallywe are using brute force KNN right now in
the PR...based on flann paper lsh did not work well but before you go to
Cross Join shuffle space might not be needed since most likely through
application specific logic (topK etc) you can cut the shuffle space...Also
most likely the brute force approach will be a benchmark tool to see how
better is your clustering based KNN solution since there are several ways
you
, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am benchmarking row vs col similarity flow on 60M x 10M matrices...
Details are in this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
For testing I am using Netflix data since the structure is very similar:
50k x 17K near dense
Hi,
I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...
While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...
Is there any checkpoint
I have a version that works well for Netflix data but now I am validating
on internal datasets..this code will work on matrix factors and sparse
matrices that has rows = 100* columnsif columns are much smaller than
rows then col based flow works well...basically we need both flows...
I did
sorted list by using a priority queue and dequeuing top N
values.
In the end, I get a record for each segment with N max values for each
segment.
Regards,
Aung
On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
wrote:
In that case you can directly use count-min
You can do it in-memory as wellget 10% topK elements from each
partition and use merge from any sort algorithm like timsortbasically
aggregateBy
Your version uses shuffle but this version is 0 shuffle..assuming your data
set is cached you will be using in-memory allReduce through
for your suggestions. In-memory version is quite useful. I do not
quite understand how you can use aggregateBy to get 10% top K elements. Can
you please give an example?
Thanks,
Aung
On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
wrote:
You can do it in-memory as well
There is also a batch prediction API in PR
https://github.com/apache/spark/pull/3098
Idea here is what Sean said...don't try to reconstruct the whole matrix
which will be dense but pick a set of users and calculate topk
recommendations for them using dense level 3 blas.we are going to merge
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
to use DIMSUM. Try to increase the threshold and see
whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x
1.5M
sparse matrix with 800M entries. With 200M
with 1.5m columns, because the output can potentially have 2.25 x
10^12 entries, which is a lot. (squares 1.5m)
Best,
Reza
On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das debasish.da...@gmail.com
wrote:
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5
m and I made
that the key would be filtered.
And then after, run a flatMap or something to make Option[B] into B.
On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map
+
combine are done, I would like
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map +
combine are done, I would like to filter the keys based on some threshold...
Is there a way to get the key, value after map+combine stages so that I can
run a filter on the keys ?
Thanks.
Deb
partitions and apply your
filtering. Then you can finish with a reduceByKey.
On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map
+ combine are done, I would like to filter the keys based
by GC pause. Did you check the GC time in the Spark
UI? -Xiangrui
On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am sometimes getting WARN from running Similarity calculation:
15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...
Seems like I need more nodes or use dimsum sampling ?
Hi,
I am sometimes getting WARN from running Similarity calculation:
15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms
exceeds 45000ms
Do I need to increase the default 45 s to larger values for cases
...
Neither play nor spray is being used in Spark right nowso it brings
dependencies and we already know about the akka conflicts...thriftserver on
the other hand is already integrated for JDBC access
On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com
wrote:
Also I wanted
Hi Michael,
I want to cache a RDD and define get() and set() operators on it. Basically
like memcached. Is it possible to build a memcached like distributed cache
using Spark SQL ? If not what do you suggest we should use for such
operations...
Thanks.
Deb
On Fri, Jul 18, 2014 at 1:00 PM,
-indexedrdd
On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Michael,
I want to cache a RDD and define get() and set() operators on it.
Basically like memcached. Is it possible to build a memcached like
distributed cache using Spark SQL ? If not what do you
PM, Debasish Das debasish.da...@gmail.com
wrote:
Thanks...this is what I was looking for...
It will be great if Ankur can give brief details about it...Basically how
does it contrast with memcached for example...
On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com
wrote
Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.
Is writing to kafka queues from spark job supported in your code ?
Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did
Hi Bui,
Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...
Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
Apriori can be thought as a post-processing on product similarity graph...I
call it product similarity but for each product you build a node which
keeps distinct users visiting the product and two product nodes are
connected by an edge if the intersection 0...you are assuming if no one
user
rdd.top collects it on master...
If you want topk for a key run map / mappartition and use a bounded
priority queue and reducebykey the queues.
I experimented with topk from algebird and bounded priority queue wrapped
over jpriority queue ( spark default)...bpq is faster
Code example is here:
I have used breeze fine with scala shell:
scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...
reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of
I run my Spark on YARN jobs as:
HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar
It uses HADOOP_CONF_DIR to schedule executors and
only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=RDD[Y])
If it has to shuffle it should be local.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das
Hi,
If I look inside algebird Monoid implementation it uses
java.io.Serializable...
But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't
see a KryoRegistrator for CMS and HLL monoid...
In these examples we will run with Kryo serialization on CMS and HLL or
they will be java
Hi,
I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?
Thanks.
Deb
userFeatures.lookup(user).head to
work ?
On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da
if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
but
the code fails on userFeatures.lookup(user).head
In computeRmse
Hi,
I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...
I am getting a weird error which I have not seen before:
MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote:
Debasish Das writes:
If the SVM is not already migrated to BFGS, that's the first thing you
should
try...Basically following LBFGS Logistic Regression come up with LBFGS
based
linear SVM...
About integrating TRON in mllib, David
If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...
About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM
@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?
Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as
Hi Martin,
This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;]
and x is what you want to find..B is 0 in this case...For mllib normally
this is labelbasically create a labeledPoint where label is 0 always...
Use mllib's linear regression and solve the following
the architecture. It has all the things you are thinking
about:)
Thanks,
Jayant
On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming data / Batch data
in HDFS and cross validated with mllib APIs but the model serving layer
will give API endpoints like Oryx
and read the models may be from
Awesome news Matei !
Congratulations to the databricks team and all the community members...
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which
I have faced this in the past and I have to put a profile -Phadoop2.3...
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install
On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote:
Hi:
I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...
For rest of them only cache if you have lot of free memory !
On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote:
I think you mean that data2 is a function of data1 in the
Hi,
We write the output of models and other information as parquet files and
later we let data APIs run SQL queries on the columnar data...
SparkSQL is used to dump the data in parquet format and now we are
considering whether using SparkSQL or Impala to read it back...
I came across this
Can't you extend a class in place of object which can be generic ?
class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] {
}
On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com
wrote:
Just realized that, of course, objects can't be generic, but how do I
create a
If the missing values are 0, then you can also look into implicit
formulation...
On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote:
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...
On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote:
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the
Only fit the data in memory where you want to run the iterative
algorithm
For map-reduce operations, it's better not to cache if you have a memory
crunch...
Also schedule the persist and unpersist such that you utilize the RAM
well...
On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei
HBase regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to
On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com
wrote:
HBase regionserver needs to be balancedyou might have some skewness
in row keys and one regionserver is under pressuretry finding that key
and replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM
affected, but may not happen to have
exhibited in your test.
On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com
wrote:
Some more debug revealed that as Sean said I have to keep the
dictionaries
persisted till I am done with the RDD manipulation.
Thanks Sean
Hi,
I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key almonds at value 5187 using:
rdd.filter{case(product, index) = product == almonds}.collect
Output:
Debug product almonds index 5187
Now I take the same dictionary and write it out as:
- zipWithIndex is being used to assign IDs. From a
recent JIRA discussion I understand this is not deterministic within a
partition so the index can be different when the RDD is reevaluated. If you
need it fixed, persist the zipped RDD on disk or in memory.
On Sep 20, 2014 8:10 PM, Debasish Das
clear from docs...
On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com
wrote:
I did not persist / cache it as I assumed zipWithIndex will preserve
order...
There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear
We dump fairly big libsvm to compare against liblinear/libsvm...the
following code dumps out libsvm format from SparseVector...
def toLibSvm(features: SparseVector): String = {
val indices = features.indices.map(_ + 1)
val values = features.values
indices.zip(values).mkString(
, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.
Best,
Reza
On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
In similarColumns, it seems with cosine similarity
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?
For logistic you might want both positive and negative feature...so just
pass it through a filter on abs and then pick top(k)
On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I am able to
the PR. We can add jaccard and other similarity measures in
later PRs.
In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.
Best,
Reza
On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das
. The PR will updated
today.
Best,
Reza
On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?
This number does not look like
, Debasish Das debasish.da...@gmail.com
wrote:
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity
1 - 100 of 156 matches
Mail list logo