Meetup invitation: Consensus based replication in Hadoop

2014-07-08 Thread Konstantin Boudnik
[cross-posted from hdfs-dev@hadoop, common-dev@hadoop]

We'd like to invite you to the 
Consensus based replication in Hadoop: A deep dive
event that we are happy to hold in our San Ramon office on July 15th at noon.
We'd like to accommodate as many people as possible, but I think are physically
limited to 30 (+/- a few), so please RSVP to this Eventbrite invitation:

https://www.eventbrite.co.uk/e/consensus-based-replication-in-hadoop-a-deep-dive-tickets-12158236613

We'll provide pizza and beverages (feel free to express your special dietary
requirements if any).

See you soon!
With regards,
  Cos

On Wed, Jun 18, 2014 at 08:45PM, Konstantin Boudnik wrote:
> Guys,
> 
> In the last a couple of weeks, we had a very good and productive initial round
> of discussions on the JIRAs. I think it is worthy to keep the momentum going
> and have a more detailed conversation. For that, we'd like to host s Hadoop
> developers meetup to get into the bowls of the consensus-based coordination
> implementation for HDFS. The proposed venue is our office in San Ramon, CA.
> 
> Considering that it is already a mid week and the following one looks short
> because of the holidays, how would the week of July 7th looks for yall?
> Tuesday or Thursday look pretty good on our end.
> 
> Please chime in on your preference either here or reach of directly to me.
> Once I have a few RSVPs I will setup an event on Eventbrite or similar.
> 
> Looking forward to your input. Regards,
>   Cos
> 
> On Thu, May 29, 2014 at 02:09PM, Konstantin Shvachko wrote:
> > Hello hadoop developers,
> > 
> > I just opened two jiras proposing to introduce ConsensusNode into HDFS and
> > a Coordination Engine into Hadoop Common. The latter should benefit HDFS
> > and  HBase as well as potentially other projects. See HDFS-6469 and
> > HADOOP-10641 for details.
> > The effort is based on the system we built at Wandisco with my colleagues,
> > who are glad to contribute it to Apache, as quite a few people in the
> > community expressed interest in this ideas and their potential applications.
> > 
> > We should probably keep technical discussions in the jiras. Here on the dev
> > list I wanted to touch-base on any logistic issues / questions.
> > - First of all, any ideas and help are very much welcome.
> > - We would like to set up a meetup to discuss this if people are
> > interested. Hadoop Summit next week may be a potential time-place to meet.
> > Not sure in what form. If not, we can organize one in our San Ramon office
> > later on.
> > - The effort may take a few months depending on the contributors schedules.
> > Would it make sense to open a branch for the ConsensusNode work?
> > - APIs and the implementation of the Coordination Engine should be a fairly
> > independent, so it may be reasonable to add it directly to Hadoop Common
> > trunk.
> > 
> > Thanks,
> > --Konstantin


Not starting the Web ui in the driver

2014-07-08 Thread Usman Ghani
Is there a way to run the spark driver program without starting the
monitoring web UI in-process? I didn't see any config setting around it.


odd test suite failures while adding functions to Catalyst

2014-07-08 Thread Will Benton
Hi all,

I was testing an addition to Catalyst today (reimplementing a Hive UDF) and ran 
into some odd failures in the test suite.  In particular, it seems that what 
most of these have in common is that an array is spuriously reversed somewhere. 
 For example, the stddev tests in the HiveCompatibilitySuite all failed this 
way (note reversed synonyms list; note also that stddev isn't the function I 
reimplemented):

   [info] - udf_std *** FAILED ***
   [info]   Results do not match for udf_std:
   [info]   DESCRIBE FUNCTION EXTENDED std
   [info]   == Logical Plan ==
   [info]   NativeCommand DESCRIBE FUNCTION EXTENDED std
   [info]   
   [info]   == Optimized Logical Plan ==
   [info]   NativeCommand DESCRIBE FUNCTION EXTENDED std
   [info]   
   [info]   == Physical Plan ==
   [info]   NativeCommand DESCRIBE FUNCTION EXTENDED std, [result#38637:0]
   [info]   result
   [info]   !== HIVE - 2 row(s) == == 
CATALYST - 2 row(s) ==
   [info]std(x) - Returns the standard deviation of a set of numbers   
std(x) - Returns the standard deviation of a set of numbers
   [info]   !Synonyms: stddev_pop, stddev  
Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:372)

I also saw a reversed array (relative to the expected array) of Hive settings 
in HiveQuerySuite.

I'll probably be able to track down where things are going wrong after getting 
away from my desk for a bit, but I thought I'd send it out to the dev list in 
case this looks familiar to anyone.  Has anyone seen this kind of failure 
before?



thanks,
wb


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.


On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks 
wrote:

> If you're thinking along these lines, have a look at the DecisionTree
> implementation in MLlib. It uses the same idea and is optimized to prevent
> multiple passes over the data by computing several splits at each level of
> tree building. The tradeoff is increased model state and computation per
> pass over the data, but fewer total passes and hopefully lower
> communication overheads than, say, shuffling data around that belongs to
> one cluster or another. Something like that could work here as well.
>
> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
> efficient way to implement it, though.
>
>
> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee  wrote:
>
> > No was thinking more top-down:
> >
> > assuming a distributed kmeans system already existing, recursively apply
> > the kmeans algorithm on data already partitioned by the previous level of
> > kmeans.
> >
> > I haven't been much of a fan of bottom up approaches like HAC mainly
> > because they assume there is already a distance metric for items to
> items.
> > This makes it hard to cluster new content. The distances between sibling
> > clusters is also hard to compute (if you have thrown away the similarity
> > matrix), do you count paths to same parent node if you are computing
> > distances between items in two adjacent nodes for example. It is also a
> bit
> > harder to distribute the computation for bottom up approaches as you have
> > to already find the nearest neighbor to an item to begin the process.
> >
> >
> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling  wrote:
> >
> > > The scikit-learn implementation may be of interest:
> > >
> > >
> > >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> > >
> > > It's a bottom up approach.  The pair of clusters for merging are
> > > chosen to minimize variance.
> > >
> > > Their code is under a BSD license so it can be used as a template.
> > >
> > > Is something like that you were thinking Hector?
> > >
> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov 
> > > wrote:
> > > > sure. more interesting problem here is choosing k at each level.
> Kernel
> > > > methods seem to be most promising.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee 
> > wrote:
> > > >
> > > >> No idea, never looked it up. Always just implemented it as doing
> > k-means
> > > >> again on each cluster.
> > > >>
> > > >> FWIW standard k-means with euclidean distance has problems too with
> > some
> > > >> dimensionality reduction methods. Swapping out the distance metric
> > with
> > > >> negative dot or cosine may help.
> > > >>
> > > >> Other more useful clustering would be hierarchical SVD. The reason
> > why I
> > > >> like hierarchical clustering is it makes for faster inference
> > especially
> > > >> over billions of users.
> > > >>
> > > >>
> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov  >
> > > >> wrote:
> > > >>
> > > >> > Hector, could you share the references for hierarchical K-means?
> > > thanks.
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee 
> > > wrote:
> > > >> >
> > > >> > > I would say for bigdata applications the most useful would be
> > > >> > hierarchical
> > > >> > > k-means with back tracking and the ability to support k nearest
> > > >> > centroids.
> > > >> > >
> > > >> > >
> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > MLlib currently has one clustering algorithm implementation,
> > > KMeans.
> > > >> > > > It would benefit from having implementations of other
> clustering
> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
> Hierarchical
> > > >> > > > Clustering, and Affinity Propagation.
> > > >> > > >
> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > > implementation,
> > > >> > > > and I saw an email on this list about interest in implementing
> > > Fuzzy
> > > >> > > > C-Means.
> > > >> > > >
> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
> > became
> > > >> > > > apparent that before I implement more clustering algorithms,
> it
> > > would
> > > >> > > > be useful to hammer out a framework to reduce code duplication
> > and
> > > >> > > > implement a consistent API.
> > > >> > > >
> > > >> > > > I'd like to gauge the interest and goals of the MLlib
> community:
> > > >> > > >
> > > >> > > > 1. Are you interested in having more clustering algorithms
> > > available?
> > > >> > > >
> > > >> > > > 2. Is the community interested in specifying a common
> framework?
> > > >> > > >
> > > >> > > > Thanks!
> > > >> > > > RJ
> > > >> > > >
> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > >

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass over the data, but fewer total passes and hopefully lower
communication overheads than, say, shuffling data around that belongs to
one cluster or another. Something like that could work here as well.

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.


On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee  wrote:

> No was thinking more top-down:
>
> assuming a distributed kmeans system already existing, recursively apply
> the kmeans algorithm on data already partitioned by the previous level of
> kmeans.
>
> I haven't been much of a fan of bottom up approaches like HAC mainly
> because they assume there is already a distance metric for items to items.
> This makes it hard to cluster new content. The distances between sibling
> clusters is also hard to compute (if you have thrown away the similarity
> matrix), do you count paths to same parent node if you are computing
> distances between items in two adjacent nodes for example. It is also a bit
> harder to distribute the computation for bottom up approaches as you have
> to already find the nearest neighbor to an item to begin the process.
>
>
> On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling  wrote:
>
> > The scikit-learn implementation may be of interest:
> >
> >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> >
> > It's a bottom up approach.  The pair of clusters for merging are
> > chosen to minimize variance.
> >
> > Their code is under a BSD license so it can be used as a template.
> >
> > Is something like that you were thinking Hector?
> >
> > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov 
> > wrote:
> > > sure. more interesting problem here is choosing k at each level. Kernel
> > > methods seem to be most promising.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee 
> wrote:
> > >
> > >> No idea, never looked it up. Always just implemented it as doing
> k-means
> > >> again on each cluster.
> > >>
> > >> FWIW standard k-means with euclidean distance has problems too with
> some
> > >> dimensionality reduction methods. Swapping out the distance metric
> with
> > >> negative dot or cosine may help.
> > >>
> > >> Other more useful clustering would be hierarchical SVD. The reason
> why I
> > >> like hierarchical clustering is it makes for faster inference
> especially
> > >> over billions of users.
> > >>
> > >>
> > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> > >> wrote:
> > >>
> > >> > Hector, could you share the references for hierarchical K-means?
> > thanks.
> > >> >
> > >> >
> > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee 
> > wrote:
> > >> >
> > >> > > I would say for bigdata applications the most useful would be
> > >> > hierarchical
> > >> > > k-means with back tracking and the ability to support k nearest
> > >> > centroids.
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > MLlib currently has one clustering algorithm implementation,
> > KMeans.
> > >> > > > It would benefit from having implementations of other clustering
> > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > >> > > > Clustering, and Affinity Propagation.
> > >> > > >
> > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > implementation,
> > >> > > > and I saw an email on this list about interest in implementing
> > Fuzzy
> > >> > > > C-Means.
> > >> > > >
> > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
> became
> > >> > > > apparent that before I implement more clustering algorithms, it
> > would
> > >> > > > be useful to hammer out a framework to reduce code duplication
> and
> > >> > > > implement a consistent API.
> > >> > > >
> > >> > > > I'd like to gauge the interest and goals of the MLlib community:
> > >> > > >
> > >> > > > 1. Are you interested in having more clustering algorithms
> > available?
> > >> > > >
> > >> > > > 2. Is the community interested in specifying a common framework?
> > >> > > >
> > >> > > > Thanks!
> > >> > > > RJ
> > >> > > >
> > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > em rnowl...@gmail.com
> > >> > > > c 954.496.2314
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Yee Yang Li Hector 
> > >> > > *google.com/+HectorYee *
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Yee Yang Li Hector 
> > >> *google.com/+HectorYee *
> > >>
> >
> >
> >

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.


On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling  wrote:

> The scikit-learn implementation may be of interest:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>
> It's a bottom up approach.  The pair of clusters for merging are
> chosen to minimize variance.
>
> Their code is under a BSD license so it can be used as a template.
>
> Is something like that you were thinking Hector?
>
> On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov 
> wrote:
> > sure. more interesting problem here is choosing k at each level. Kernel
> > methods seem to be most promising.
> >
> >
> > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee  wrote:
> >
> >> No idea, never looked it up. Always just implemented it as doing k-means
> >> again on each cluster.
> >>
> >> FWIW standard k-means with euclidean distance has problems too with some
> >> dimensionality reduction methods. Swapping out the distance metric with
> >> negative dot or cosine may help.
> >>
> >> Other more useful clustering would be hierarchical SVD. The reason why I
> >> like hierarchical clustering is it makes for faster inference especially
> >> over billions of users.
> >>
> >>
> >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> >> wrote:
> >>
> >> > Hector, could you share the references for hierarchical K-means?
> thanks.
> >> >
> >> >
> >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee 
> wrote:
> >> >
> >> > > I would say for bigdata applications the most useful would be
> >> > hierarchical
> >> > > k-means with back tracking and the ability to support k nearest
> >> > centroids.
> >> > >
> >> > >
> >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> >> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> >> > > > It would benefit from having implementations of other clustering
> >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> >> > > > Clustering, and Affinity Propagation.
> >> > > >
> >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> >> > > > and I saw an email on this list about interest in implementing
> Fuzzy
> >> > > > C-Means.
> >> > > >
> >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> >> > > > apparent that before I implement more clustering algorithms, it
> would
> >> > > > be useful to hammer out a framework to reduce code duplication and
> >> > > > implement a consistent API.
> >> > > >
> >> > > > I'd like to gauge the interest and goals of the MLlib community:
> >> > > >
> >> > > > 1. Are you interested in having more clustering algorithms
> available?
> >> > > >
> >> > > > 2. Is the community interested in specifying a common framework?
> >> > > >
> >> > > > Thanks!
> >> > > > RJ
> >> > > >
> >> > > > [1] - https://github.com/apache/spark/pull/1248
> >> > > >
> >> > > >
> >> > > > --
> >> > > > em rnowl...@gmail.com
> >> > > > c 954.496.2314
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Yee Yang Li Hector 
> >> > > *google.com/+HectorYee *
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Yee Yang Li Hector 
> >> *google.com/+HectorYee *
> >>
>
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>



-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov  wrote:

> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee  wrote:
>
> > No idea, never looked it up. Always just implemented it as doing k-means
> > again on each cluster.
> >
> > FWIW standard k-means with euclidean distance has problems too with some
> > dimensionality reduction methods. Swapping out the distance metric with
> > negative dot or cosine may help.
> >
> > Other more useful clustering would be hierarchical SVD. The reason why I
> > like hierarchical clustering is it makes for faster inference especially
> > over billions of users.
> >
> >
> > On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> > wrote:
> >
> > > Hector, could you share the references for hierarchical K-means?
> thanks.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee 
> wrote:
> > >
> > > > I would say for bigdata applications the most useful would be
> > > hierarchical
> > > > k-means with back tracking and the ability to support k nearest
> > > centroids.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> > > > > It would benefit from having implementations of other clustering
> > > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > > Clustering, and Affinity Propagation.
> > > > >
> > > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> > > > > and I saw an email on this list about interest in implementing
> Fuzzy
> > > > > C-Means.
> > > > >
> > > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > > apparent that before I implement more clustering algorithms, it
> would
> > > > > be useful to hammer out a framework to reduce code duplication and
> > > > > implement a consistent API.
> > > > >
> > > > > I'd like to gauge the interest and goals of the MLlib community:
> > > > >
> > > > > 1. Are you interested in having more clustering algorithms
> available?
> > > > >
> > > > > 2. Is the community interested in specifying a common framework?
> > > > >
> > > > > Thanks!
> > > > > RJ
> > > > >
> > > > > [1] - https://github.com/apache/spark/pull/1248
> > > > >
> > > > >
> > > > > --
> > > > > em rnowl...@gmail.com
> > > > > c 954.496.2314
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Yee Yang Li Hector 
> > > > *google.com/+HectorYee *
> > > >
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector 
> > *google.com/+HectorYee *
> >
>



-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach.  The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov  wrote:
> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee  wrote:
>
>> No idea, never looked it up. Always just implemented it as doing k-means
>> again on each cluster.
>>
>> FWIW standard k-means with euclidean distance has problems too with some
>> dimensionality reduction methods. Swapping out the distance metric with
>> negative dot or cosine may help.
>>
>> Other more useful clustering would be hierarchical SVD. The reason why I
>> like hierarchical clustering is it makes for faster inference especially
>> over billions of users.
>>
>>
>> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
>> wrote:
>>
>> > Hector, could you share the references for hierarchical K-means? thanks.
>> >
>> >
>> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:
>> >
>> > > I would say for bigdata applications the most useful would be
>> > hierarchical
>> > > k-means with back tracking and the ability to support k nearest
>> > centroids.
>> > >
>> > >
>> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > MLlib currently has one clustering algorithm implementation, KMeans.
>> > > > It would benefit from having implementations of other clustering
>> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
>> > > > Clustering, and Affinity Propagation.
>> > > >
>> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
>> > > > and I saw an email on this list about interest in implementing Fuzzy
>> > > > C-Means.
>> > > >
>> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
>> > > > apparent that before I implement more clustering algorithms, it would
>> > > > be useful to hammer out a framework to reduce code duplication and
>> > > > implement a consistent API.
>> > > >
>> > > > I'd like to gauge the interest and goals of the MLlib community:
>> > > >
>> > > > 1. Are you interested in having more clustering algorithms available?
>> > > >
>> > > > 2. Is the community interested in specifying a common framework?
>> > > >
>> > > > Thanks!
>> > > > RJ
>> > > >
>> > > > [1] - https://github.com/apache/spark/pull/1248
>> > > >
>> > > >
>> > > > --
>> > > > em rnowl...@gmail.com
>> > > > c 954.496.2314
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Yee Yang Li Hector 
>> > > *google.com/+HectorYee *
>> > >
>> >
>>
>>
>>
>> --
>> Yee Yang Li Hector 
>> *google.com/+HectorYee *
>>



-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.


On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee  wrote:

> No idea, never looked it up. Always just implemented it as doing k-means
> again on each cluster.
>
> FWIW standard k-means with euclidean distance has problems too with some
> dimensionality reduction methods. Swapping out the distance metric with
> negative dot or cosine may help.
>
> Other more useful clustering would be hierarchical SVD. The reason why I
> like hierarchical clustering is it makes for faster inference especially
> over billions of users.
>
>
> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> wrote:
>
> > Hector, could you share the references for hierarchical K-means? thanks.
> >
> >
> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:
> >
> > > I would say for bigdata applications the most useful would be
> > hierarchical
> > > k-means with back tracking and the ability to support k nearest
> > centroids.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > > It would benefit from having implementations of other clustering
> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > Clustering, and Affinity Propagation.
> > > >
> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > > and I saw an email on this list about interest in implementing Fuzzy
> > > > C-Means.
> > > >
> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > apparent that before I implement more clustering algorithms, it would
> > > > be useful to hammer out a framework to reduce code duplication and
> > > > implement a consistent API.
> > > >
> > > > I'd like to gauge the interest and goals of the MLlib community:
> > > >
> > > > 1. Are you interested in having more clustering algorithms available?
> > > >
> > > > 2. Is the community interested in specifying a common framework?
> > > >
> > > > Thanks!
> > > > RJ
> > > >
> > > > [1] - https://github.com/apache/spark/pull/1248
> > > >
> > > >
> > > > --
> > > > em rnowl...@gmail.com
> > > > c 954.496.2314
> > > >
> > >
> > >
> > >
> > > --
> > > Yee Yang Li Hector 
> > > *google.com/+HectorYee *
> > >
> >
>
>
>
> --
> Yee Yang Li Hector 
> *google.com/+HectorYee *
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov  wrote:

> Hector, could you share the references for hierarchical K-means? thanks.
>
>
> On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowl...@gmail.com
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector 
> > *google.com/+HectorYee *
> >
>



-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
Hector, could you share the references for hierarchical K-means? thanks.


On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:

> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  wrote:
>
> > Hi all,
> >
> > MLlib currently has one clustering algorithm implementation, KMeans.
> > It would benefit from having implementations of other clustering
> > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > Clustering, and Affinity Propagation.
> >
> > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > and I saw an email on this list about interest in implementing Fuzzy
> > C-Means.
> >
> > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > apparent that before I implement more clustering algorithms, it would
> > be useful to hammer out a framework to reduce code duplication and
> > implement a consistent API.
> >
> > I'd like to gauge the interest and goals of the MLlib community:
> >
> > 1. Are you interested in having more clustering algorithms available?
> >
> > 2. Is the community interested in specifying a common framework?
> >
> > Thanks!
> > RJ
> >
> > [1] - https://github.com/apache/spark/pull/1248
> >
> >
> > --
> > em rnowl...@gmail.com
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector 
> *google.com/+HectorYee *
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling  wrote:

> Thanks, Hector! Your feedback is useful.
>
> On Tuesday, July 8, 2014, Hector Yee  wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  > > wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowl...@gmail.com 
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector 
> > *google.com/+HectorYee *
> >
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful.

On Tuesday, July 8, 2014, Hector Yee  wrote:

> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  > wrote:
>
> > Hi all,
> >
> > MLlib currently has one clustering algorithm implementation, KMeans.
> > It would benefit from having implementations of other clustering
> > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > Clustering, and Affinity Propagation.
> >
> > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > and I saw an email on this list about interest in implementing Fuzzy
> > C-Means.
> >
> > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > apparent that before I implement more clustering algorithms, it would
> > be useful to hammer out a framework to reduce code duplication and
> > implement a consistent API.
> >
> > I'd like to gauge the interest and goals of the MLlib community:
> >
> > 1. Are you interested in having more clustering algorithms available?
> >
> > 2. Is the community interested in specifying a common framework?
> >
> > Thanks!
> > RJ
> >
> > [1] - https://github.com/apache/spark/pull/1248
> >
> >
> > --
> > em rnowl...@gmail.com 
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector 
> *google.com/+HectorYee *
>


-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza
Hi Anish,

Spark, like MapReduce, makes an effort to schedule tasks on the same nodes
and racks that the input blocks reside on.

-Sandy


On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:

> Hi All
>
> My apologies for very basic question, do we have full support of data
> locality in Spark MapReduce.
>
> Please suggest.
>
> --
> Anish Sneh
> "Experience is the best teacher."
> http://in.linkedin.com/in/anishsneh
>
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.


On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  wrote:

> Hi all,
>
> MLlib currently has one clustering algorithm implementation, KMeans.
> It would benefit from having implementations of other clustering
> algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> Clustering, and Affinity Propagation.
>
> I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> and I saw an email on this list about interest in implementing Fuzzy
> C-Means.
>
> Based on Sean Owen's review of my MiniBatch KMeans code, it became
> apparent that before I implement more clustering algorithms, it would
> be useful to hammer out a framework to reduce code duplication and
> implement a consistent API.
>
> I'd like to gauge the interest and goals of the MLlib community:
>
> 1. Are you interested in having more clustering algorithms available?
>
> 2. Is the community interested in specifying a common framework?
>
> Thanks!
> RJ
>
> [1] - https://github.com/apache/spark/pull/1248
>
>
> --
> em rnowl...@gmail.com
> c 954.496.2314
>



-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


Re: Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread Reynold Xin
This blog post probably clarifies a lot of things:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html




On Tue, Jul 8, 2014 at 12:24 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:

> Hi All
>
> I read somewhere that Cloudera announced Hive on Spark, since AmpLab
> already have Shark. I was trying to understand is it rebranding of Shark or
> they are planning something new altogether.
>
> Please suggest.
>
> --
> Anish Sneh
> "Experience is the best teacher."
> http://in.linkedin.com/in/anishsneh
>
>


Data Locality In Spark

2014-07-08 Thread anishs...@yahoo.co.in
Hi All

My apologies for very basic question, do we have full support of data locality 
in Spark MapReduce.

Please suggest.

-- 
Anish Sneh
"Experience is the best teacher."
http://in.linkedin.com/in/anishsneh



Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread anishs...@yahoo.co.in
Hi All

I read somewhere that Cloudera announced Hive on Spark, since AmpLab already 
have Shark. I was trying to understand is it rebranding of Shark or they are 
planning something new altogether.

Please suggest.

-- 
Anish Sneh
"Experience is the best teacher."
http://in.linkedin.com/in/anishsneh



Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Hi all,

MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.

I'd like to gauge the interest and goals of the MLlib community:

1. Are you interested in having more clustering algorithms available?

2. Is the community interested in specifying a common framework?

Thanks!
RJ

[1] - https://github.com/apache/spark/pull/1248


-- 
em rnowl...@gmail.com
c 954.496.2314


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Haoyuan Li
Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory",
have the same ser/de overhead. Shark loads data from outsize of the process
in Tachyon mode with the following benefits:


   - In-memory data sharing across multiple Shark instances (i.e. stronger
   isolation)
   - Instant recovery of in-memory tables
   - Reduce heap size => faster GC in shark
   - If the table is larger than the memory size, only the hot columns will
   be cached in memory

from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon

Haoyuan


On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson  wrote:

> Shark's in-memory format is already serialized (it's compressed and
> column-based).
>
>
> On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
> wrote:
>
> > You are ignoring serde costs :-)
> >
> > - Mridul
> >
> > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson 
> wrote:
> > > Tachyon should only be marginally less performant than memory_only,
> > because
> > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> transfer
> > > the data over a pipe from Tachyon; we can directly read from the
> buffers
> > in
> > > the same way that Shark reads from its in-memory columnar format.
> > >
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> > > wrote:
> > >
> > >> hi, when i create a table, i can point the cache strategy using
> > >> shark.cache,
> > >> i think "shark.cache=memory_only"  means data are managed by spark,
> and
> > >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> > >>  means  data are managed by tachyon which is off heap, and data are
> not
> > in
> > >> the same jvm with excutor,  so spark will load data from tachyon for
> > each
> > >> query sql , so,  is  tachyon less efficient than memory_only cache
> > strategy
> > >>  ?
> > >> if yes, can we let spark load all data once from tachyon  for all sql
> > query
> > >>  if i want to use tachyon cache strategy since tachyon is more HA than
> > >> memory_only ?
> > >>
> >
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Shark's in-memory format is already serialized (it's compressed and
column-based).


On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
wrote:

> You are ignoring serde costs :-)
>
> - Mridul
>
> On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  wrote:
> > Tachyon should only be marginally less performant than memory_only,
> because
> > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> > the data over a pipe from Tachyon; we can directly read from the buffers
> in
> > the same way that Shark reads from its in-memory columnar format.
> >
> >
> >
> > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> > wrote:
> >
> >> hi, when i create a table, i can point the cache strategy using
> >> shark.cache,
> >> i think "shark.cache=memory_only"  means data are managed by spark, and
> >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> >>  means  data are managed by tachyon which is off heap, and data are not
> in
> >> the same jvm with excutor,  so spark will load data from tachyon for
> each
> >> query sql , so,  is  tachyon less efficient than memory_only cache
> strategy
> >>  ?
> >> if yes, can we let spark load all data once from tachyon  for all sql
> query
> >>  if i want to use tachyon cache strategy since tachyon is more HA than
> >> memory_only ?
> >>
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Mridul Muralidharan
You are ignoring serde costs :-)

- Mridul

On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  wrote:
> Tachyon should only be marginally less performant than memory_only, because
> we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> the data over a pipe from Tachyon; we can directly read from the buffers in
> the same way that Shark reads from its in-memory columnar format.
>
>
>
> On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> wrote:
>
>> hi, when i create a table, i can point the cache strategy using
>> shark.cache,
>> i think "shark.cache=memory_only"  means data are managed by spark, and
>> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
>>  means  data are managed by tachyon which is off heap, and data are not in
>> the same jvm with excutor,  so spark will load data from tachyon for each
>> query sql , so,  is  tachyon less efficient than memory_only cache strategy
>>  ?
>> if yes, can we let spark load all data once from tachyon  for all sql query
>>  if i want to use tachyon cache strategy since tachyon is more HA than
>> memory_only ?
>>


Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks
As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
removed:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L81

You will also likely need to change the logic where it determines the
number of features (currently line 95)


On Tue, Jul 8, 2014 at 12:22 AM, Sean Owen  wrote:

> On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
> zhengbing...@huawei.com> wrote:
>
> >
> > 1)  I download the imdb data from
> > http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> > LBFGS
> > 2)  I find the imdb data are zero-based-index data
> >
>
> Since the method is for parsing the LIBSVM format, and its labels are
> always 1-indexed IIUC, I don't think it would make sense to read 0-indexed
> labels. It sounds like that input is not properly formatted, unless anyone
> knows to the contrary?
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Tachyon should only be marginally less performant than memory_only, because
we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
the data over a pipe from Tachyon; we can directly read from the buffers in
the same way that Shark reads from its in-memory columnar format.



On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
wrote:

> hi, when i create a table, i can point the cache strategy using
> shark.cache,
> i think "shark.cache=memory_only"  means data are managed by spark, and
> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
>  means  data are managed by tachyon which is off heap, and data are not in
> the same jvm with excutor,  so spark will load data from tachyon for each
> query sql , so,  is  tachyon less efficient than memory_only cache strategy
>  ?
> if yes, can we let spark load all data once from tachyon  for all sql query
>  if i want to use tachyon cache strategy since tachyon is more HA than
> memory_only ?
>


Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu
This is the correct page: http://spark.apache.org/community.html

Cheers

On Jul 8, 2014, at 4:43 AM, Ted Yu  wrote:

> See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html
> 
> Cheers
> 
> On Jul 8, 2014, at 4:17 AM, Leon Zhang  wrote:
> 
>> 


Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu
See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html

Cheers

On Jul 8, 2014, at 4:17 AM, Leon Zhang  wrote:

> 


(send this email to subscribe)

2014-07-08 Thread Leon Zhang



on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread qingyang li
hi, when i create a table, i can point the cache strategy using shark.cache,
i think "shark.cache=memory_only"  means data are managed by spark, and
data are in the same jvm with excutor;   while  "shark.cache=tachyon"
 means  data are managed by tachyon which is off heap, and data are not in
the same jvm with excutor,  so spark will load data from tachyon for each
query sql , so,  is  tachyon less efficient than memory_only cache strategy
 ?
if yes, can we let spark load all data once from tachyon  for all sql query
 if i want to use tachyon cache strategy since tachyon is more HA than
memory_only ?


Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
zhengbing...@huawei.com> wrote:

>
> 1)  I download the imdb data from
> http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> LBFGS
> 2)  I find the imdb data are zero-based-index data
>

Since the method is for parsing the LIBSVM format, and its labels are
always 1-indexed IIUC, I don't think it would make sense to read 0-indexed
labels. It sounds like that input is not properly formatted, unless anyone
knows to the contrary?