Re: HA support for Spark

2014-12-10 Thread Jun Feng Liu
Right, perhaps also need preserve some DAG information? I am wondering if there is any work around this. Sandy Ryza

Re: SparkSQL not honoring schema

2014-12-10 Thread Alessandro Baretta
Hey Michael, Thanks for the clarification. I was actually assuming the query would fail. Ok, so this means I will have to do the validation in an RDD transformation feeding into the SchemaRDD. On Wed, Dec 10, 2014 at 6:27 PM, Michael Armbrust wrote: > As the scala doc for applySchema says, "It

Re: SparkSQL not honoring schema

2014-12-10 Thread Michael Armbrust
As the scala doc for applySchema says, "It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exceptions." We don't check as doing runtime reflection on all of the data would be very expensive. You will o

SparkSQL not honoring schema

2014-12-10 Thread Alessandro Baretta
Hello, I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some of the Rows in the RDD are malformed--that is, they do not conform to the schema defined by the StructType. When running a select statement on this SchemaRDD I would expect SparkSQL to either reject the malformed ro

Re: Row Similarity

2014-12-10 Thread Reza Zadeh
Here we go: https://issues.apache.org/jira/browse/SPARK-4823 On Wed, Dec 10, 2014 at 9:01 PM, Debasish Das wrote: > I added code to compute topK products for each user and topK user for each > product in SPARK-3066.. > > That is different than row similarity calculation as we need both user and

Re: Row Similarity

2014-12-10 Thread Debasish Das
I added code to compute topK products for each user and topK user for each product in SPARK-3066.. That is different than row similarity calculation as we need both user and product factors to calculate the topK recommendations.. For (1) and (2) we are trying to answer similarUsers to given a use

Re: Is Apache JIRA down?

2014-12-10 Thread Patrick Wendell
I believe many apache services are/were down due to an outage. On Wed, Dec 10, 2014 at 5:24 PM, Nicholas Chammas wrote: > Nevermind, seems to be back up now. > > On Wed Dec 10 2014 at 7:46:30 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> For example: https://issues.apache.org/ji

Is Apache JIRA down?

2014-12-10 Thread Nicholas Chammas
For example: https://issues.apache.org/jira/browse/SPARK-3431 Where do we report/track issues with JIRA itself being down? Nick

Re: Is Apache JIRA down?

2014-12-10 Thread Nicholas Chammas
Nevermind, seems to be back up now. On Wed Dec 10 2014 at 7:46:30 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > For example: https://issues.apache.org/jira/browse/SPARK-3431 > > Where do we report/track issues with JIRA itself being down? > > Nick >

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei > On Dec 10, 2014, at 1:08 PM, Patrick Wendell wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.2.0! > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commi

Re: Row Similarity

2014-12-10 Thread Reza Zadeh
It's not so cheap to compute row similarities when there are many rows, as it amounts to computing the outer product of a matrix A (i.e. computing AA^T, which is expensive). There is a JIRA to track handling (1) and (2) more efficiently than computing all pairs: https://issues.apache.org/jira/brow

[VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc.

[RESULT] [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-10 Thread Patrick Wendell
This vote is closed in favor of RC2. On Fri, Dec 5, 2014 at 2:02 PM, Patrick Wendell wrote: > Hey All, > > Thanks all for the continued testing! > > The issue I mentioned earlier SPARK-4498 was fixed earlier this week > (hat tip to Mark Hamstra who contributed to fix). > > In the interim a few sm

Re: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-10 Thread Patrick Wendell
Hi Andrew, It looks like somehow you are including jars from the upstream Apache Hive 0.13 project on your classpath. For Spark 1.2 Hive 0.13 support, we had to modify Hive to use a different version of Kryo that was compatible with Spark's Kryo version. https://github.com/pwendell/hive/commit/5b

Row Similarity

2014-12-10 Thread Debasish Das
Hi, It seems there are multiple places where we would like to compute row similarity (accurate or approximate similarities) Basically through RowMatrix columnSimilarities we can compute column similarities of a tall skinny matrix Similarly we should have an API in RowMatrix called rowSimilaritie

Re: jenkins downtime: 730-930am, 12/12/14

2014-12-10 Thread shane knapp
reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp wrote: > i'll send out a reminder next week, but i wanted to give a heads up: i'll > be bringing down the entire jenkins infrastructure for reboots and system > updates. > > please let me know if t

RE: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-10 Thread Andrew Lee
Apologize for the format, somehow it got messed up and linefeed were removed. Here's a reformatted version. Hi All, I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh to include auxiliaries JARs and datanucleus*.jars from Hive, however, when I run HiveContext, it gives me

Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-10 Thread Andrew Lee
Hi All, I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh to include auxiliaries JARs and datanucleus*.jars from Hive, however, when I run HiveContext, it gives me the following error: Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.

Re: HA support for Spark

2014-12-10 Thread Sandy Ryza
I think that if we were able to maintain the full set of created RDDs as well as some scheduler and block manager state, it would be enough for most apps to recover. On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu wrote: > Well, it should not be mission impossible thinking there are so many HA > s

Tachyon in Spark

2014-12-10 Thread Jun Feng Liu
Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards Jun Feng Liu IBM China System

Re: HA support for Spark

2014-12-10 Thread Jun Feng Liu
Well, it should not be mission impossible thinking there are so many HA solution existing today. I would interest to know if there is any specific difficult. Best Regards Jun Feng Liu IBM China Systems & Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail: liuj...@cn.ibm.com B

Maven profile in MLLib netlib-lgpl not working (1.1.1)

2014-12-10 Thread Guillaume Pitel
Hi Issue created https://issues.apache.org/jira/browse/SPARK-4816 Probably a maven-related question for profiles in child modules I couldn't find a clean solution, just a workaround : modify pom.xml in mllib module to force activation of netlib-lgpl module. Hope a maven expert will help. Gu

Re: HA support for Spark

2014-12-10 Thread Reynold Xin
This would be plausible for specific purposes such as Spark streaming or Spark SQL, but I don't think it is doable for general Spark driver since it is just a normal JVM process with arbitrary program state. On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu wrote: > Do we have any high availability

HA support for Spark

2014-12-10 Thread Jun Feng Liu
Do we have any high availability support in Spark driver level? For example, if we want spark drive can move to another node continue execution when failure happen. I can see the RDD checkpoint can help to serialization the status of RDD. I can image to load the check point from another node wh