Re: matrix computation in spark

2014-11-17 Thread Zongheng Yang
There's been some work at the AMPLab on a distributed matrix library on top
of Spark; see here [1]. In particular, the repo contains a couple
factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi  wrote:

> Hi,
> Matrix computation is critical for algorithm efficiency like least square,
> Kalman filter and so on.
> For now, the mllib module offers limited linear algebra on matrix,
> especially for distributed matrix.
>
> We have been working on establishing distributed matrix computation APIs
> based on data structures in MLlib.
> The main idea is to partition the matrix into sub-blocks, based on the
> strategy in the following paper.
> http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
> In our experiment, it's communication-optimal.
> But operations like factorization may not be appropriate to carry out in
> blocks.
>
> Any suggestions and guidance are welcome.
>
> Thanks,
> Yuxi
>
>


Re: preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Zongheng Yang
Hi Will,

These three environment variables are needed [1].

I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting
the source distribution seems to be required. Docs contribution will
be much appreciated!

[1] 
https://github.com/apache/spark/tree/master/sql#other-dependencies-for-developers

Zongheng

On Thu, Jul 17, 2014 at 7:51 PM, Will Benton  wrote:
> Hi all,
>
> What's the preferred environment for generating golden test outputs for new 
> Hive tests?  In particular:
>
> * what Hadoop version and Hive version should I be using,
> * are there particular distributions people have run successfully, and
> * are there any system properties or environment variables (beyond 
> HADOOP_HOME, HIVE_HOME, and HIVE_DEV_HOME) I need to set before running the 
> suite?
>
> I ask because I'm getting some errors while trying to add new tests and would 
> like to eliminate any possible problems caused by differences between what my 
> environment offers and what Spark expects.  (I'm currently running with the 
> Fedora packages for Hadoop 2.2.0 and a locally-built Hive 0.12.0.)  Since 
> I'll only be using this for generating test outputs, something as simple to 
> set up as possible would be great.
>
> (Once I get something working, I'll be happy to write it up and contribute it 
> as developer docs.)
>
>
> thanks,
> wb


Re: compilation error in Catalyst module

2014-08-06 Thread Zongheng Yang
Hi Ted,

By refreshing do you mean you have done 'mvn clean'?

On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu  wrote:
> I refreshed my workspace.
> I got the following error with this command:
>
> mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install
>
> [ERROR] bad symbolic reference. A signature in package.class refers to term
> scalalogging
> in package com.typesafe which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> package.class.
> [ERROR]
> /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
> bad symbolic reference. A signature in package.class refers to term slf4j
> in value com.typesafe.scalalogging which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> package.class.
> [ERROR] package object trees extends Logging {
> [ERROR]  ^
> [ERROR] two errors found
>
> Has anyone else seen the above ?
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang
For your second question: hql() (as well as sql()) does not launch a
Spark job immediately; instead, it fires off the Spark SQL
parser/optimizer/planner pipeline first, and a Spark job will be
started after the a physical execution plan is selected. Therefore,
your hand-rolled end-to-end measurement includes the time to go
through the Spark SQL code path, and the times reported inside the UI
are the execution times of the Spark job(s) only.

On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera  wrote:
> Hi,
>
> I have been playing around with spark for a couple of days. I am
> using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
> implementation is to run Hive queries on Spark. I used JavaHiveContext to
> achieve this (As per the examples).
>
> I have 2 questions.
> 1. I am wondering how I could get the execution times of a spark job? Does
> Spark provide monitoring facilities in the form of an API?
>
> 2. I used a laymen way to get the execution times by enclosing a
> JavaHiveContext.hql method with System.nanoTime() as follows
>
> long start, end;
> JavaHiveContext hiveCtx;
> JavaSchemaRDD hiveResult;
>
> start = System.nanoTime();
> hiveResult = hiveCtx.hql(query);
> end = System.nanoTime();
> System.out.println(start-end);
>
> But the result I got is drastically different from the execution times
> recorded in SparkUI. Can you please explain this disparity?
>
> Look forward to hearing from you.
>
> rgds
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Matix operations in Scala \ Spark

2014-10-25 Thread Zongheng Yang
We recently released a research prototype of a lightweight matrix library
for Spark here: https://github.com/amplab/ml-matrix which does support norm
and subtraction. Feel free to base your implementation on top of it.

Zongheng
On Sat, Oct 25, 2014 at 07:12 Xuefeng Wu  wrote:

> how about non/spire or twitter/scalding
>
>
> Yours, Xuefeng Wu 吴雪峰 敬上
>
> > On 2014年10月25日, at 下午9:03, salexln  wrote:
> >
> > Hi guys,
> >
> > I'm working on the implementation of the FuzzyCMeans algorithm (Jira
> > https://issues.apache.org/jira/browse/SPARK-2344)
> > and I need to use some operations on Matrices (norm & subtraction)
> >
> > I could not find any Scala\ Spark Matrix class that will support these
> > actions.
> >
> > Should I implement the Matrix as a two dimensional array and make my own
> > code for the norm & subtraction ?
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Matix-operations-
> in-Scala-Spark-tp8959.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>