Re: Compile failure with SBT on master

2014-06-16 Thread Ted Yu
I used the same command on Linux and it passed:

Linux k.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13
11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Cheers


On Mon, Jun 16, 2014 at 9:29 PM, Andrew Ash and...@andrewash.com wrote:

 I can't run sbt/sbt gen-idea on a clean checkout of Spark master.

 I get resolution errors on junit#junit;4.10!junit.zip(source)

 As shown below:

 aash@aash-mbp /tmp/git/spark$ sbt/sbt gen-idea
 Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home as
 default JAVA_HOME.
 Note, this will be overridden by -java-home if it is set.
 [info] Loading project definition from
 /private/tmp/git/spark/project/project
 [info] Loading project definition from /private/tmp/git/spark/project
 [info] Set current project to root (in build file:/private/tmp/git/spark/)
 [info] Creating IDEA module for project 'assembly' ...
 [info] Updating {file:/private/tmp/git/spark/}core...
 [info] Resolving org.fusesource.jansi#jansi;1.4 ...
 [warn] [FAILED ] junit#junit;4.10!junit.zip(source):  (0ms)
 [warn]  local: tried
 [warn]   /Users/aash/.ivy2/local/junit/junit/4.10/sources/junit.zip
 [warn]  public: tried
 [warn]   http://repo1.maven.org/maven2/junit/junit/4.10/junit-4.10.zip
 [warn]  Maven Repository: tried
 [warn]
 http://repo.maven.apache.org/maven2/junit/junit/4.10/junit-4.10.zip
 [warn]  Apache Repository: tried
 [warn]

 https://repository.apache.org/content/repositories/releases/junit/junit/4.10/junit-4.10.zip
 [warn]  JBoss Repository: tried
 [warn]

 https://repository.jboss.org/nexus/content/repositories/releases/junit/junit/4.10/junit-4.10.zip
 [warn]  MQTT Repository: tried
 [warn]

 https://repo.eclipse.org/content/repositories/paho-releases/junit/junit/4.10/junit-4.10.zip
 [warn]  Cloudera Repository: tried
 [warn]

 http://repository.cloudera.com/artifactory/cloudera-repos/junit/junit/4.10/junit-4.10.zip
 [warn]  Pivotal Repository: tried
 [warn]
 http://repo.spring.io/libs-release/junit/junit/4.10/junit-4.10.zip
 [warn]  Maven2 Local: tried
 [warn]   file:/Users/aash/.m2/repository/junit/junit/4.10/junit-4.10.zip
 [warn] ::
 [warn] ::  FAILED DOWNLOADS::
 [warn] :: ^ see resolution messages for details  ^ ::
 [warn] ::
 [warn] :: junit#junit;4.10!junit.zip(source)
 [warn] ::
 sbt.ResolveException: download failed: junit#junit;4.10!junit.zip(source)

 By bumping the junit dependency to 4.11 I'm able to generate the IDE files.
  Are other people having this problem or does everyone use the maven
 configuration?

 Andrew



Re: Compile failure with SBT on master

2014-06-17 Thread Ted Yu
I didn't get that error on Mac either:

java version 1.7.0_55
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Darwin TYus-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep
29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64


On Mon, Jun 16, 2014 at 10:04 PM, Andrew Ash and...@andrewash.com wrote:

 Maybe it's a Mac OS X thing?


 On Mon, Jun 16, 2014 at 9:57 PM, Ted Yu yuzhih...@gmail.com wrote:

  I used the same command on Linux and it passed:
 
  Linux k.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13
  11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux
 
  Cheers
 
 
  On Mon, Jun 16, 2014 at 9:29 PM, Andrew Ash and...@andrewash.com
 wrote:
 
   I can't run sbt/sbt gen-idea on a clean checkout of Spark master.
  
   I get resolution errors on junit#junit;4.10!junit.zip(source)
  
   As shown below:
  
   aash@aash-mbp /tmp/git/spark$ sbt/sbt gen-idea
   Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home
 as
   default JAVA_HOME.
   Note, this will be overridden by -java-home if it is set.
   [info] Loading project definition from
   /private/tmp/git/spark/project/project
   [info] Loading project definition from /private/tmp/git/spark/project
   [info] Set current project to root (in build
  file:/private/tmp/git/spark/)
   [info] Creating IDEA module for project 'assembly' ...
   [info] Updating {file:/private/tmp/git/spark/}core...
   [info] Resolving org.fusesource.jansi#jansi;1.4 ...
   [warn] [FAILED ] junit#junit;4.10!junit.zip(source):  (0ms)
   [warn]  local: tried
   [warn]   /Users/aash/.ivy2/local/junit/junit/4.10/sources/junit.zip
   [warn]  public: tried
   [warn]   http://repo1.maven.org/maven2/junit/junit/4.10/junit-4.10.zip
   [warn]  Maven Repository: tried
   [warn]
   http://repo.maven.apache.org/maven2/junit/junit/4.10/junit-4.10.zip
   [warn]  Apache Repository: tried
   [warn]
  
  
 
 https://repository.apache.org/content/repositories/releases/junit/junit/4.10/junit-4.10.zip
   [warn]  JBoss Repository: tried
   [warn]
  
  
 
 https://repository.jboss.org/nexus/content/repositories/releases/junit/junit/4.10/junit-4.10.zip
   [warn]  MQTT Repository: tried
   [warn]
  
  
 
 https://repo.eclipse.org/content/repositories/paho-releases/junit/junit/4.10/junit-4.10.zip
   [warn]  Cloudera Repository: tried
   [warn]
  
  
 
 http://repository.cloudera.com/artifactory/cloudera-repos/junit/junit/4.10/junit-4.10.zip
   [warn]  Pivotal Repository: tried
   [warn]
   http://repo.spring.io/libs-release/junit/junit/4.10/junit-4.10.zip
   [warn]  Maven2 Local: tried
   [warn]
 file:/Users/aash/.m2/repository/junit/junit/4.10/junit-4.10.zip
   [warn] ::
   [warn] ::  FAILED DOWNLOADS::
   [warn] :: ^ see resolution messages for details  ^ ::
   [warn] ::
   [warn] :: junit#junit;4.10!junit.zip(source)
   [warn] ::
   sbt.ResolveException: download failed:
 junit#junit;4.10!junit.zip(source)
  
   By bumping the junit dependency to 4.11 I'm able to generate the IDE
  files.
Are other people having this problem or does everyone use the maven
   configuration?
  
   Andrew
  
 



Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu
See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html

Cheers

On Jul 8, 2014, at 4:17 AM, Leon Zhang leonca...@gmail.com wrote:

 


Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu
This is the correct page: http://spark.apache.org/community.html

Cheers

On Jul 8, 2014, at 4:43 AM, Ted Yu yuzhih...@gmail.com wrote:

 See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html
 
 Cheers
 
 On Jul 8, 2014, at 4:17 AM, Leon Zhang leonca...@gmail.com wrote:
 
 


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Ted Yu
HADOOP-10456 is fixed in hadoop 2.4.1

Does this mean that synchronization
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
2.4.1 ?

Cheers


On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote:

 The most important issue in this release is actually an ammendment to
 an earlier fix. The original fix caused a deadlock which was a
 regression from 1.0.0-1.0.1:

 Issue:
 https://issues.apache.org/jira/browse/SPARK-1097

 1.0.1 Fix:
 https://github.com/apache/spark/pull/1273/files (had a deadlock)

 1.0.2 Fix:
 https://github.com/apache/spark/pull/1409/files

 I failed to correctly label this on JIRA, but I've updated it!

 On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
 mich...@databricks.com wrote:
  That query is looking at Fix Version not Target Version.  The fact
 that
  the first one is still open is only because the bug is not resolved in
  master.  It is fixed in 1.0.2.  The second one is partially fixed in
 1.0.2,
  but is not worth blocking the release for.
 
 
  On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  TD, there are a couple of unresolved issues slated for 1.0.2
  
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
  .
  Should they be edited somehow?
 
 
  On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
  tathagata.das1...@gmail.com
  wrote:
 
   Please vote on releasing the following candidate as Apache Spark
 version
   1.0.2.
  
   This release fixes a number of bugs in Spark 1.0.1.
   Some of the notable ones are
   - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
   SPARK-1199. The fix was reverted for 1.0.2.
   - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
   HDFS CSV file.
   The full list is at http://s.apache.org/9NJ
  
   The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
  
   The release files, including signatures, digests, etc can be found at:
   http://people.apache.org/~tdas/spark-1.0.2-rc1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/tdas.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1024/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
  
   Please vote on releasing this package as Apache Spark 1.0.2!
  
   The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
   a majority of at least 3 +1 PMC votes are cast.
   [ ] +1 Release this package as Apache Spark 1.0.2
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
 



Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
I found 0.13.1 artifacts in maven:
http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-metastore%7C0.13.1%7Cjar

However, Spark uses groupId of org.spark-project.hive, not org.apache.hive

Can someone tell me how it is supposed to work ?

Cheers


On Mon, Jul 28, 2014 at 7:44 AM, Steve Nunez snu...@hortonworks.com wrote:

 I saw a note earlier, perhaps on the user list, that at least one person is
 using Hive 0.13. Anyone got a working build configuration for this version
 of Hive?

 Regards,
 - Steve



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
Talked with Owen offline. He confirmed that as of 0.13, hive-exec is still
uber jar.

Right now I am facing the following error building against Hive 0.13.1 :

[ERROR] Failed to execute goal on project spark-hive_2.10: Could not
resolve dependencies for project
org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The following
artifacts could not be resolved:
org.spark-project.hive:hive-metastore:jar:0.13.1,
org.spark-project.hive:hive-exec:jar:0.13.1,
org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
org.spark-project.hive:hive-metastore:jar:0.13.1 in
http://repo.maven.apache.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of maven-repo
has elapsed or updates are forced - [Help 1]

Some hint would be appreciated.

Cheers


On Mon, Jul 28, 2014 at 9:15 AM, Sean Owen so...@cloudera.com wrote:

 Yes, it is published. As of previous versions, at least, hive-exec
 included all of its dependencies *in its artifact*, making it unusable
 as-is because it contained copies of dependencies that clash with
 versions present in other artifacts, and can't be managed with Maven
 mechanisms.

 I am not sure why hive-exec was not published normally, with just its
 own classes. That's why it was copied, into an artifact with just
 hive-exec code.

 You could do the same thing for hive-exec 0.13.1.
 Or maybe someone knows that it's published more 'normally' now.
 I don't think hive-metastore is related to this question?

 I am no expert on the Hive artifacts, just remembering what the issue
 was initially in case it helps you get to a similar solution.

 On Mon, Jul 28, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com wrote:
  hive-exec (as of 0.13.1) is published here:
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-exec%7C0.13.1%7Cjar
 
  Should a JIRA be opened so that dependency on hive-metastore can be
  replaced by dependency on hive-exec ?
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:
 
  The reason for org.spark-project.hive is that Spark relies on
  hive-exec, but the Hive project does not publish this artifact by
  itself, only with all its dependencies as an uber jar. Maybe that's
  been improved. If so, you need to point at the new hive-exec and
  perhaps sort out its dependencies manually in your build.
 
  On Mon, Jul 28, 2014 at 4:01 PM, Ted Yu yuzhih...@gmail.com wrote:
   I found 0.13.1 artifacts in maven:
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-metastore%7C0.13.1%7Cjar
  
   However, Spark uses groupId of org.spark-project.hive, not
  org.apache.hive
  
   Can someone tell me how it is supposed to work ?
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 7:44 AM, Steve Nunez snu...@hortonworks.com
  wrote:
  
   I saw a note earlier, perhaps on the user list, that at least one
  person is
   using Hive 0.13. Anyone got a working build configuration for this
  version
   of Hive?
  
   Regards,
   - Steve
  
  
  
   --
   CONFIDENTIALITY NOTICE
   NOTICE: This message is intended for the use of the individual or
  entity to
   which it is addressed and may contain information that is
 confidential,
   privileged and exempt from disclosure under applicable law. If the
  reader
   of this message is not the intended recipient, you are hereby
 notified
  that
   any printing, copying, dissemination, distribution, disclosure or
   forwarding of this communication is strictly prohibited. If you have
   received this communication in error, please contact the sender
  immediately
   and delete it from your system. Thank You.
  
 



Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
Owen helped me find this:
https://issues.apache.org/jira/browse/HIVE-7423

I guess this means that for Hive 0.14, Spark should be able to directly
pull in hive-exec-core.jar

Cheers


On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 It would be great if the hive team can fix that issue. If not, we'll
 have to continue forking our own version of Hive to change the way it
 publishes artifacts.

 - Patrick

 On Mon, Jul 28, 2014 at 9:34 AM, Ted Yu yuzhih...@gmail.com wrote:
  Talked with Owen offline. He confirmed that as of 0.13, hive-exec is
 still
  uber jar.
 
  Right now I am facing the following error building against Hive 0.13.1 :
 
  [ERROR] Failed to execute goal on project spark-hive_2.10: Could not
  resolve dependencies for project
  org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The following
  artifacts could not be resolved:
  org.spark-project.hive:hive-metastore:jar:0.13.1,
  org.spark-project.hive:hive-exec:jar:0.13.1,
  org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
  org.spark-project.hive:hive-metastore:jar:0.13.1 in
  http://repo.maven.apache.org/maven2 was cached in the local repository,
  resolution will not be reattempted until the update interval of
 maven-repo
  has elapsed or updates are forced - [Help 1]
 
  Some hint would be appreciated.
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 9:15 AM, Sean Owen so...@cloudera.com wrote:
 
  Yes, it is published. As of previous versions, at least, hive-exec
  included all of its dependencies *in its artifact*, making it unusable
  as-is because it contained copies of dependencies that clash with
  versions present in other artifacts, and can't be managed with Maven
  mechanisms.
 
  I am not sure why hive-exec was not published normally, with just its
  own classes. That's why it was copied, into an artifact with just
  hive-exec code.
 
  You could do the same thing for hive-exec 0.13.1.
  Or maybe someone knows that it's published more 'normally' now.
  I don't think hive-metastore is related to this question?
 
  I am no expert on the Hive artifacts, just remembering what the issue
  was initially in case it helps you get to a similar solution.
 
  On Mon, Jul 28, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com wrote:
   hive-exec (as of 0.13.1) is published here:
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-exec%7C0.13.1%7Cjar
  
   Should a JIRA be opened so that dependency on hive-metastore can be
   replaced by dependency on hive-exec ?
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 8:26 AM, Sean Owen so...@cloudera.com
 wrote:
  
   The reason for org.spark-project.hive is that Spark relies on
   hive-exec, but the Hive project does not publish this artifact by
   itself, only with all its dependencies as an uber jar. Maybe that's
   been improved. If so, you need to point at the new hive-exec and
   perhaps sort out its dependencies manually in your build.
  
   On Mon, Jul 28, 2014 at 4:01 PM, Ted Yu yuzhih...@gmail.com wrote:
I found 0.13.1 artifacts in maven:
   
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-metastore%7C0.13.1%7Cjar
   
However, Spark uses groupId of org.spark-project.hive, not
   org.apache.hive
   
Can someone tell me how it is supposed to work ?
   
Cheers
   
   
On Mon, Jul 28, 2014 at 7:44 AM, Steve Nunez 
 snu...@hortonworks.com
   wrote:
   
I saw a note earlier, perhaps on the user list, that at least one
   person is
using Hive 0.13. Anyone got a working build configuration for this
   version
of Hive?
   
Regards,
- Steve
   
   
   
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or
   entity to
which it is addressed and may contain information that is
  confidential,
privileged and exempt from disclosure under applicable law. If the
   reader
of this message is not the intended recipient, you are hereby
  notified
   that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you
 have
received this communication in error, please contact the sender
   immediately
and delete it from your system. Thank You.
   
  
 



Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
After manually copying hive 0.13.1 jars to local maven repo, I got the
following errors when building spark-hive_2.10 module :

[ERROR]
/homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:182:
type mismatch;
 found   : String
 required: Array[String]
[ERROR]   val proc: CommandProcessor =
CommandProcessorFactory.get(tokens(0), hiveconf)
[ERROR]
 ^
[ERROR]
/homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:60:
value getAllPartitionsForPruner is not a member of org.apache.
 hadoop.hive.ql.metadata.Hive
[ERROR] client.getAllPartitionsForPruner(table).toSeq
[ERROR]^
[ERROR]
/homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:267:
overloaded method constructor TableDesc with alternatives:
  (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2:
Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc
and
  ()org.apache.hadoop.hive.ql.plan.TableDesc
 cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer],
Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in
value tableDesc)(in   value tableDesc)], java.util.Properties)
[ERROR]   val tableDesc = new TableDesc(
[ERROR]   ^
[WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing
with a stub.
[WARNING] Class org.antlr.runtime.Token not found - continuing with a stub.
[WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a
stub.
[ERROR]
 while compiling:
/homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
during phase: typer
 library version: version 2.10.4
compiler version: version 2.10.4

The above shows incompatible changes between 0.12 and 0.13.1
e.g. the first error corresponds to the following method
in CommandProcessorFactory :
  public static CommandProcessor get(String[] cmd, HiveConf conf)

Cheers


On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com wrote:

 So, do we have a short-term fix until Hive 0.14 comes out? Perhaps adding
 the hive-exec jar to the spark-project repo? It doesn¹t look like there¹s
 a release date schedule for 0.14.



 On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote:

 Exactly, forgot to mention Hulu team also made changes to cope with those
 incompatibility issues, but they said that¹s relatively easy once the
 re-packaging work is done.
 
 
 On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  I've heard from Cloudera that there were hive internal changes between
  0.12 and 0.13 that required code re-writing. Over time it might be
  possible for us to integrate with hive using API's that are more
  stable (this is the domain of Michael/Cheng/Yin more than me!). It
  would be interesting to see what the Hulu folks did.
 
  - Patrick
 
  On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com
  wrote:
   AFAIK, according a recent talk, Hulu team in China has built Spark SQL
   against Hive 0.13 (or 0.13.1?) successfully. Basically they also
   re-packaged Hive 0.13 as what the Spark team did. The slides of the
 talk
   hasn't been released yet though.
  
  
   On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com wrote:
  
   Owen helped me find this:
   https://issues.apache.org/jira/browse/HIVE-7423
  
   I guess this means that for Hive 0.14, Spark should be able to
 directly
   pull in hive-exec-core.jar
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com
 
   wrote:
  
It would be great if the hive team can fix that issue. If not,
 we'll
have to continue forking our own version of Hive to change the way
 it
publishes artifacts.
   
- Patrick
   
On Mon, Jul 28, 2014 at 9:34 AM, Ted Yu yuzhih...@gmail.com
 wrote:
 Talked with Owen offline. He confirmed that as of 0.13,
 hive-exec is
still
 uber jar.

 Right now I am facing the following error building against Hive
  0.13.1
   :

 [ERROR] Failed to execute goal on project spark-hive_2.10: Could
 not
 resolve dependencies for project
 org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The
 following
 artifacts could not be resolved:
 org.spark-project.hive:hive-metastore:jar:0.13.1,
 org.spark-project.hive:hive-exec:jar:0.13.1,
 org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
 org.spark-project.hive:hive-metastore:jar:0.13.1 in
 http://repo.maven.apache.org/maven2 was cached in the local
   repository,
 resolution will not be reattempted until the update interval of
maven-repo
 has elapsed or updates are forced - [Help 1]

 Some hint would be appreciated.

 Cheers


 On Mon, Jul 28, 2014 at 9:15 AM, Sean Owen so...@cloudera.com
  wrote:

 Yes, it is published. As of previous versions, at least,
 hive-exec
 included all of its

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
I was looking for a class where reflection-related code should reside.

I found this but don't think it is the proper class for bridging
differences between hive 0.12 and 0.13.1:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

Cheers


On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote:

 After manually copying hive 0.13.1 jars to local maven repo, I got the
 following errors when building spark-hive_2.10 module :

 [ERROR]
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:182:
 type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor =
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]
^
 [ERROR]
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:60:
 value getAllPartitionsForPruner is not a member of org.apache.
  hadoop.hive.ql.metadata.Hive
 [ERROR] client.getAllPartitionsForPruner(table).toSeq
 [ERROR]^
 [ERROR]
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:267:
 overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2:
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer],
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in
 value tableDesc)(in   value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [WARNING] Class org.antlr.runtime.tree.CommonTree not found - continuing
 with a stub.
 [WARNING] Class org.antlr.runtime.Token not found - continuing with a stub.
 [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing with a
 stub.
 [ERROR]
  while compiling:
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
 during phase: typer
  library version: version 2.10.4
 compiler version: version 2.10.4

 The above shows incompatible changes between 0.12 and 0.13.1
 e.g. the first error corresponds to the following method
 in CommandProcessorFactory :
   public static CommandProcessor get(String[] cmd, HiveConf conf)

 Cheers


 On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com
 wrote:

 So, do we have a short-term fix until Hive 0.14 comes out? Perhaps adding
 the hive-exec jar to the spark-project repo? It doesn¹t look like there¹s
 a release date schedule for 0.14.



 On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote:

 Exactly, forgot to mention Hulu team also made changes to cope with those
 incompatibility issues, but they said that¹s relatively easy once the
 re-packaging work is done.
 
 
 On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com

 wrote:
 
  I've heard from Cloudera that there were hive internal changes between
  0.12 and 0.13 that required code re-writing. Over time it might be
  possible for us to integrate with hive using API's that are more
  stable (this is the domain of Michael/Cheng/Yin more than me!). It
  would be interesting to see what the Hulu folks did.
 
  - Patrick
 
  On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com
  wrote:
   AFAIK, according a recent talk, Hulu team in China has built Spark
 SQL
   against Hive 0.13 (or 0.13.1?) successfully. Basically they also
   re-packaged Hive 0.13 as what the Spark team did. The slides of the
 talk
   hasn't been released yet though.
  
  
   On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com wrote:
  
   Owen helped me find this:
   https://issues.apache.org/jira/browse/HIVE-7423
  
   I guess this means that for Hive 0.14, Spark should be able to
 directly
   pull in hive-exec-core.jar
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell 
 pwend...@gmail.com
   wrote:
  
It would be great if the hive team can fix that issue. If not,
 we'll
have to continue forking our own version of Hive to change the way
 it
publishes artifacts.
   
- Patrick
   
On Mon, Jul 28, 2014 at 9:34 AM, Ted Yu yuzhih...@gmail.com
 wrote:
 Talked with Owen offline. He confirmed that as of 0.13,
 hive-exec is
still
 uber jar.

 Right now I am facing the following error building against Hive
  0.13.1
   :

 [ERROR] Failed to execute goal on project spark-hive_2.10: Could
 not
 resolve dependencies for project
 org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The
 following
 artifacts could not be resolved:
 org.spark-project.hive:hive-metastore:jar:0.13.1,
 org.spark-project.hive:hive-exec:jar:0.13.1,
 org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
 org.spark-project.hive:hive-metastore:jar:0.13.1 in
 http://repo.maven.apache.org/maven2 was cached in the local

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Ted Yu
bq. Either way its unclear to if there is any reason to use reflection to
support multiple versions, instead of just upgrading to Hive 0.13.0

Which Spark release would this Hive upgrade take place ?
I agree it is cleaner to upgrade Hive dependency vs. introducing reflection.

Cheers


On Mon, Jul 28, 2014 at 5:22 PM, Michael Armbrust mich...@databricks.com
wrote:

 A few things:
  - When we upgrade to Hive 0.13.0, Patrick will likely republish the
 hive-exec jar just as we did for 0.12.0
  - Since we have to tie into some pretty low level APIs it is unsurprising
 that the code doesn't just compile out of the box against 0.13.0
  - ScalaReflection is for determining Schema from Scala classes, not
 reflection based bridge code.  Either way its unclear to if there is any
 reason to use reflection to support multiple versions, instead of just
 upgrading to Hive 0.13.0

 One question I have is, What is the goal of upgrading to hive 0.13.0?  Is
 it purely because you are having problems connecting to newer metastores?
  Are there some features you are hoping for?  This will help me prioritize
 this effort.

 Michael


 On Mon, Jul 28, 2014 at 4:05 PM, Ted Yu yuzhih...@gmail.com wrote:

  I was looking for a class where reflection-related code should reside.
 
  I found this but don't think it is the proper class for bridging
  differences between hive 0.12 and 0.13.1:
 
 
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   After manually copying hive 0.13.1 jars to local maven repo, I got the
   following errors when building spark-hive_2.10 module :
  
   [ERROR]
  
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:182:
   type mismatch;
found   : String
required: Array[String]
   [ERROR]   val proc: CommandProcessor =
   CommandProcessorFactory.get(tokens(0), hiveconf)
   [ERROR]
  ^
   [ERROR]
  
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:60:
   value getAllPartitionsForPruner is not a member of org.apache.
hadoop.hive.ql.metadata.Hive
   [ERROR] client.getAllPartitionsForPruner(table).toSeq
   [ERROR]^
   [ERROR]
  
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:267:
   overloaded method constructor TableDesc with alternatives:
 (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2:
   Class[_],x$3:
  java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc
   and
 ()org.apache.hadoop.hive.ql.plan.TableDesc
cannot be applied to
 (Class[org.apache.hadoop.hive.serde2.Deserializer],
   Class[(some other)?0(in value tableDesc)(in value tableDesc)],
  Class[?0(in
   value tableDesc)(in   value tableDesc)], java.util.Properties)
   [ERROR]   val tableDesc = new TableDesc(
   [ERROR]   ^
   [WARNING] Class org.antlr.runtime.tree.CommonTree not found -
 continuing
   with a stub.
   [WARNING] Class org.antlr.runtime.Token not found - continuing with a
  stub.
   [WARNING] Class org.antlr.runtime.tree.Tree not found - continuing
 with a
   stub.
   [ERROR]
while compiling:
  
 
 /homes/xx/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
   during phase: typer
library version: version 2.10.4
   compiler version: version 2.10.4
  
   The above shows incompatible changes between 0.12 and 0.13.1
   e.g. the first error corresponds to the following method
   in CommandProcessorFactory :
 public static CommandProcessor get(String[] cmd, HiveConf conf)
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 1:32 PM, Steve Nunez snu...@hortonworks.com
   wrote:
  
   So, do we have a short-term fix until Hive 0.14 comes out? Perhaps
  adding
   the hive-exec jar to the spark-project repo? It doesn¹t look like
  there¹s
   a release date schedule for 0.14.
  
  
  
   On 7/28/14, 10:50, Cheng Lian lian.cs@gmail.com wrote:
  
   Exactly, forgot to mention Hulu team also made changes to cope with
  those
   incompatibility issues, but they said that¹s relatively easy once the
   re-packaging work is done.
   
   
   On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com
 
  
   wrote:
   
I've heard from Cloudera that there were hive internal changes
  between
0.12 and 0.13 that required code re-writing. Over time it might be
possible for us to integrate with hive using API's that are more
stable (this is the domain of Michael/Cheng/Yin more than me!). It
would be interesting to see what the Hulu folks did.
   
- Patrick
   
On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian 
 lian.cs@gmail.com
wrote:
 AFAIK, according a recent talk, Hulu team in China has built
 Spark
   SQL
 against Hive 0.13 (or 0.13.1?) successfully. Basically they also
 re-packaged Hive 0.13 as what the Spark team did

Re: subscribe dev list for spark

2014-07-30 Thread Ted Yu
See Mailing list section of:
https://spark.apache.org/community.html


On Wed, Jul 30, 2014 at 6:53 PM, Grace syso...@gmail.com wrote:





Re: failed to build spark with maven for both 1.0.1 and latest master branch

2014-07-31 Thread Ted Yu
The following command succeeded (on Linux) on Spark master checked out this
morning:

mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install

FYI


On Thu, Jul 31, 2014 at 1:36 PM, yao yaosheng...@gmail.com wrote:

 Hi TD,

 I've asked my colleagues to do the same thing but compile still fails.
 However, maven build succeeded once I built it on my personal macbook (with
 the latest MacOS Yosemite). So I guess there might be something wrong in my
 build environment. Wonder if anyone tried to compile spark using maven
 under Mavericks, please let me know your result.

 Thanks
 Shengzhe


 On Thu, Jul 31, 2014 at 1:25 AM, Tathagata Das 
 tathagata.das1...@gmail.com
 wrote:

  Does a mvn clean or sbt/sbt clean help?
 
  TD
 
  On Wed, Jul 30, 2014 at 9:25 PM, yao yaosheng...@gmail.com wrote:
   Hi Folks,
  
   Today I am trying to build spark using maven; however, the following
   command failed consistently for both 1.0.1 and the latest master.
  (BTW,
  it
   seems sbt works fine: *sbt/sbt -Dhadoop.version=2.4.0 -Pyarn clean
   assembly)*
  
   Environment: Mac OS Mavericks
   Maven: 3.2.2 (installed by homebrew)
  
  
  
  
   *export M2_HOME=/usr/local/Cellar/maven/3.2.2/libexec/export
   PATH=$M2_HOME/bin:$PATHexport MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
   -XX:ReservedCodeCacheSize=512mmvn -Pyarn -Phadoop-2.4
   -Dhadoop.version=2.4.0 -DskipTests clean package*
  
   Build outputs:
  
   [INFO] Scanning for projects...
   [INFO]
  
 
   [INFO] Reactor Build Order:
   [INFO]
   [INFO] Spark Project Parent POM
   [INFO] Spark Project Core
   [INFO] Spark Project Bagel
   [INFO] Spark Project GraphX
   [INFO] Spark Project ML Library
   [INFO] Spark Project Streaming
   [INFO] Spark Project Tools
   [INFO] Spark Project Catalyst
   [INFO] Spark Project SQL
   [INFO] Spark Project Hive
   [INFO] Spark Project REPL
   [INFO] Spark Project YARN Parent POM
   [INFO] Spark Project YARN Stable API
   [INFO] Spark Project Assembly
   [INFO] Spark Project External Twitter
   [INFO] Spark Project External Kafka
   [INFO] Spark Project External Flume
   [INFO] Spark Project External ZeroMQ
   [INFO] Spark Project External MQTT
   [INFO] Spark Project Examples
   [INFO]
   [INFO]
  
 
   [INFO] Building Spark Project Parent POM 1.0.1
   [INFO]
  
 
   [INFO]
   [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ spark-parent
  ---
   [INFO]
   [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @
   spark-parent ---
   [INFO]
   [INFO] --- build-helper-maven-plugin:1.8:add-source
 (add-scala-sources) @
   spark-parent ---
   [INFO] Source directory:
   /Users/syao/git/grid/thirdparty/spark/src/main/scala added.
   [INFO]
   [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
   spark-parent ---
   [INFO]
   [INFO] --- scala-maven-plugin:3.1.6:add-source (scala-compile-first) @
   spark-parent ---
   [INFO] Add Test Source directory:
   /Users/syao/git/grid/thirdparty/spark/src/test/scala
   [INFO]
   [INFO] --- scala-maven-plugin:3.1.6:compile (scala-compile-first) @
   spark-parent ---
   [INFO] No sources to compile
   [INFO]
   [INFO] --- build-helper-maven-plugin:1.8:add-test-source
   (add-scala-test-sources) @ spark-parent ---
   [INFO] Test Source directory:
   /Users/syao/git/grid/thirdparty/spark/src/test/scala added.
   [INFO]
   [INFO] --- scala-maven-plugin:3.1.6:testCompile
  (scala-test-compile-first)
   @ spark-parent ---
   [INFO] No sources to compile
   [INFO]
   [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor)
 @
   spark-parent ---
   [INFO]
   [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (create-source-jar) @
   spark-parent ---
   [INFO]
   [INFO]
  
 
   [INFO] Building Spark Project Core 1.0.1
   [INFO]
  
 
   [INFO]
   [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @
 spark-core_2.10
   ---
   [INFO]
   [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @
   spark-core_2.10 ---
   [INFO]
   [INFO] --- build-helper-maven-plugin:1.8:add-source
 (add-scala-sources) @
   spark-core_2.10 ---
   [INFO] Source directory:
   /Users/syao/git/grid/thirdparty/spark/core/src/main/scala added.
   [INFO]
   [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
   spark-core_2.10 ---
   [INFO]
   [INFO] --- exec-maven-plugin:1.2.1:exec (default) @ spark-core_2.10 ---
   Archive:  lib/py4j-0.8.1-src.zip
 inflating: build/py4j/tests/java_map_test.py
extracting: build/py4j/tests/__init__.py
 inflating: build/py4j/tests/java_gateway_test.py
 inflating: build/py4j/tests/java_callback_test.py
 inflating: build/py4j/tests/java_list_test.py
 

compilation error in Catalyst module

2014-08-06 Thread Ted Yu
I refreshed my workspace.
I got the following error with this command:

mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install

[ERROR] bad symbolic reference. A signature in package.class refers to term
scalalogging
in package com.typesafe which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
package.class.
[ERROR]
/homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
bad symbolic reference. A signature in package.class refers to term slf4j
in value com.typesafe.scalalogging which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
package.class.
[ERROR] package object trees extends Logging {
[ERROR]  ^
[ERROR] two errors found

Has anyone else seen the above ?

Thanks


Re: compilation error in Catalyst module

2014-08-06 Thread Ted Yu
Forgot to do that step.

Now compilation passes.


On Wed, Aug 6, 2014 at 1:36 PM, Zongheng Yang zonghen...@gmail.com wrote:

 Hi Ted,

 By refreshing do you mean you have done 'mvn clean'?

 On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu yuzhih...@gmail.com wrote:
  I refreshed my workspace.
  I got the following error with this command:
 
  mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install
 
  [ERROR] bad symbolic reference. A signature in package.class refers to
 term
  scalalogging
  in package com.typesafe which is not available.
  It may be completely missing from the current classpath, or the version
 on
  the classpath might be incompatible with the version used when compiling
  package.class.
  [ERROR]
 
 /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
  bad symbolic reference. A signature in package.class refers to term slf4j
  in value com.typesafe.scalalogging which is not available.
  It may be completely missing from the current classpath, or the version
 on
  the classpath might be incompatible with the version used when compiling
  package.class.
  [ERROR] package object trees extends Logging {
  [ERROR]  ^
  [ERROR] two errors found
 
  Has anyone else seen the above ?
 
  Thanks



Re: Unit tests in 5 minutes

2014-08-08 Thread Ted Yu
How about using parallel execution feature of maven-surefire-plugin
(assuming all the tests were made parallel friendly) ?

http://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html

Cheers


On Fri, Aug 8, 2014 at 9:14 AM, Sean Owen so...@cloudera.com wrote:

 A common approach is to separate unit tests from integration tests.
 Maven has support for this distinction. I'm not sure it helps a lot
 though, since it only helps you to not run integration tests all the
 time. But lots of Spark tests are integration-test-like and are
 important to run to know a change works.

 I haven't heard of a plugin to run different test suites remotely on
 many machines, but I would not be surprised if it exists.

 The Jenkins servers aren't CPU-bound as far as I can tell. It's that
 the tests spend a lot of time waiting for bits to start up or
 complete. That implies the existing tests could be sped up by just
 running in parallel locally. I recall someone recently proposed this?

 And I think the problem with that is simply that some of the tests
 collide with each other, by opening up the same port at the same time
 for example. I know that kind of problem is being attacked even right
 now. But if all the tests were made parallel friendly, I imagine
 parallelism could be enabled and speed up builds greatly without any
 remote machines.


 On Fri, Aug 8, 2014 at 5:01 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Howdy,
 
  Do we think it's both feasible and worthwhile to invest in getting our
 unit
  tests to finish in under 5 minutes (or something similarly brief) when
 run
  by Jenkins?
 
  Unit tests currently seem to take anywhere from 30 min to 2 hours. As
  people add more tests, I imagine this time will only grow. I think it
 would
  be better for both contributors and reviewers if they didn't have to wait
  so long for test results; PR reviews would be shorter, if nothing else.
 
  I don't know how how this is normally done, but maybe it wouldn't be too
  much work to get a test cycle to feel lighter.
 
  Most unit tests are independent and can be run concurrently, right? Would
  it make sense to build a given patch on many servers at once and send
  disjoint sets of unit tests to each?
 
  I'd be interested in working on something like that if possible (and
  sensible).
 
  Nick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Ted Yu
Hi,
Using the following command on (refreshed) master branch:
mvn clean package -DskipTests

I got:

constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
---
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference.
A signature in TestSuiteBase.class refers to term dstream
in package org.apache.spark.streaming which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
TestSuiteBase.class.
at
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
at scala.collection.immutable.List.forall(List.scala:84)
at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.Message$$anon$1.apply(Message.scala:8)
at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at scala.collection.immutable.List.foreach(List.scala:318)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139)
at xsbt.API$ApiPhase.processScalaUnit(API.scala:54)
at xsbt.API$ApiPhase.processUnit(API.scala:38)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at xsbt.API$ApiPhase.run(API.scala:34)
at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 

Re: Dependency hell in Spark applications

2014-09-05 Thread Ted Yu
From output of dependency:tree:

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
spark-streaming_2.10 ---
[INFO] org.apache.spark:spark-streaming_2.10:jar:1.1.0-SNAPSHOT
INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile
...
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile
[INFO] |  |  +- commons-codec:commons-codec:jar:1.5:compile
[INFO] |  |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile

bq. excluding httpclient from spark-streaming dependency in your sbt/maven
project

This should work.


On Fri, Sep 5, 2014 at 3:14 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 If httpClient dependency is coming from Hive, you could build Spark without
 Hive. Alternatively, have you tried excluding httpclient from
 spark-streaming dependency in your sbt/maven project?

 TD



 On Thu, Sep 4, 2014 at 6:42 AM, Koert Kuipers ko...@tresata.com wrote:

  custom spark builds should not be the answer. at least not if spark ever
  wants to have a vibrant community for spark apps.
 
  spark does support a user-classpath-first option, which would deal with
  some of these issues, but I don't think it works.
  On Sep 4, 2014 9:01 AM, Felix Garcia Borrego fborr...@gilt.com
 wrote:
 
   Hi,
   I run into the same issue and apart from the ideas Aniket said, I only
   could find a nasty workaround. Add my custom
  PoolingClientConnectionManager
   to my classpath.
  
  
  
 
 http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi/25488955#25488955
  
  
  
   On Thu, Sep 4, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote:
  
Dumb question -- are you using a Spark build that includes the
 Kinesis
dependency? that build would have resolved conflicts like this for
you. Your app would need to use the same version of the Kinesis
 client
SDK, ideally.
   
All of these ideas are well-known, yes. In cases of super-common
dependencies like Guava, they are already shaded. This is a
less-common source of conflicts so I don't think http-client is
shaded, especially since it is not used directly by Spark. I think
this is a case of your app conflicting with a third-party dependency?
   
I think OSGi is deemed too over the top for things like this.
   
On Thu, Sep 4, 2014 at 11:35 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
 I am trying to use Kinesis as source to Spark Streaming and have
 run
into a
 dependency issue that can't be resolved without making my own
 custom
Spark
 build. The issue is that Spark is transitively dependent
 on org.apache.httpcomponents:httpclient:jar:4.1.2 (I think because
 of
 libfb303 coming from hbase and hive-serde) whereas AWS SDK is
  dependent
 on org.apache.httpcomponents:httpclient:jar:4.2. When I package and
  run
 Spark Streaming application, I get the following:

 Caused by: java.lang.NoSuchMethodError:

   
  
 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at

   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at

   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at

   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at

   
  
 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at

   
  
 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at

 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at

   
  
 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at

   
  
 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at

   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at

   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at

   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)

 I can create a custom Spark build with
 org.apache.httpcomponents:httpclient:jar:4.2 included in the
 assembly
but I
 was wondering if this is something Spark devs have noticed and are
looking
 to resolve in near releases. Here are my thoughts on this issue:

 Containers that allow running custom user code have to often
 resolve
 dependency 

BasicOperationsSuite failing ?

2014-09-29 Thread Ted Yu
Hi,
Running test suite in trunk, I got:

^[[32mBasicOperationsSuite:^[[0m
^[[32m- map^[[0m
^[[32m- flatMap^[[0m
^[[32m- filter^[[0m
^[[32m- glom^[[0m
^[[32m- mapPartitions^[[0m
^[[32m- repartition (more partitions)^[[0m
^[[32m- repartition (fewer partitions)^[[0m
^[[32m- groupByKey^[[0m
^[[32m- reduceByKey^[[0m
^[[32m- reduce^[[0m
^[[32m- count^[[0m
^[[32m- countByValue^[[0m
^[[32m- mapValues^[[0m
^[[32m- flatMapValues^[[0m
^[[32m- union^[[0m
^[[32m- StreamingContext.union^[[0m
^[[32m- transform^[[0m
^[[32m- transformWith^[[0m
^[[32m- StreamingContext.transform^[[0m
^[[32m- cogroup^[[0m
^[[32m- join^[[0m
^[[32m- leftOuterJoin^[[0m
^[[32m- rightOuterJoin^[[0m
^[[32m- fullOuterJoin^[[0m
^[[32m- updateStateByKey^[[0m
^[[32m- updateStateByKey - object lifecycle^[[0m
^[[32m- slice^[[0m
^[[32m- slice - has not been initialized^[[0m
^[[32m- rdd cleanup - map and window^[[0m
^[[32m- rdd cleanup - updateStateByKey^[[0m
^[[31m- rdd cleanup - input blocks and persisted RDDs *** FAILED ***^[[0m
^[[31m  org.scalatest.exceptions.TestFailedException was thrown.
(BasicOperationsSuite.scala:528)^[[0m

However, using sbt for this testsuite, it seemed to pass:

[info] - slice - has not been initialized
[info] - rdd cleanup - map and window
[info] - rdd cleanup - updateStateByKey
Exception in thread Thread-561 org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:700)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:700)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1406)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[info] - rdd cleanup - input blocks and persisted RDDs
[info] ScalaTest
[info] Run completed in 1 minute, 1 second.
[info] Total number of tests run: 31
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 31, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 31, Failed 0, Errors 0, Passed 31
java.lang.AssertionError: assertion failed: List(object package$DebugNode,
object package$DebugNode)
at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
at
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
at
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:2991)
at
scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1371)
at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120)
at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:166)
at

Re: Extending Scala style checks

2014-10-01 Thread Ted Yu
Please take a look at WhitespaceEndOfLineChecker under:
http://www.scalastyle.org/rules-0.1.0.html

Cheers

On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 As discussed here https://github.com/apache/spark/pull/2619, it would be
 good to extend our Scala style checks to programmatically enforce as many
 of our style rules as possible.

 Does anyone know if it's relatively straightforward to enforce additional
 rules like the no trailing spaces rule mentioned in the linked PR?

 Nick



Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Ted Yu
I performed build on latest master branch but didn't get compilation error.

FYI

On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi,

 I just submitted a patch https://github.com/apache/spark/pull/2864/files
 with one line change

 but the Jenkins told me it's failed to compile on the unrelated files?


 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console


 Best,

 Nan



Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Koert:
Have you tried adding the following on your commandline ?

-Dscalastyle.failOnViolation=false

Cheers

On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Koert:
If you have time, you can try this diff - with which you would be able to
specify the following on the command line:
-Dscalastyle.failonviolation=false

diff --git a/pom.xml b/pom.xml
index 687cc63..108585e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -123,6 +123,7 @@
 log4j.version1.2.17/log4j.version
 hadoop.version1.0.4/hadoop.version
 protobuf.version2.4.1/protobuf.version
+scalastyle.failonviolationtrue/scalastyle.failonviolation
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.94.6/hbase.version
 flume.version1.4.0/flume.version
@@ -1071,7 +1072,7 @@
 version0.4.0/version
 configuration
   verbosefalse/verbose
-  failOnViolationtrue/failOnViolation
+  failOnViolation${scalastyle.failonviolation}/failOnViolation
   includeTestSourceDirectoryfalse/includeTestSourceDirectory
   failOnWarningfalse/failOnWarning
   sourceDirectory${basedir}/src/main/scala/sourceDirectory



On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote:

 Hey Ted,
 i tried:
 mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

 no luck, still get
 [ERROR] Failed to execute goal
 org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
 spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
 violation(s). - [Help 1]


 On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because
 of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors
 somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code
 that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Created SPARK-4066 and attached patch there.

On Thu, Oct 23, 2014 at 1:07 PM, Koert Kuipers ko...@tresata.com wrote:

 great thanks i will do that

 On Thu, Oct 23, 2014 at 3:55 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 If you have time, you can try this diff - with which you would be able to
 specify the following on the command line:
 -Dscalastyle.failonviolation=false

 diff --git a/pom.xml b/pom.xml
 index 687cc63..108585e 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -123,6 +123,7 @@
  log4j.version1.2.17/log4j.version
  hadoop.version1.0.4/hadoop.version
  protobuf.version2.4.1/protobuf.version
 +scalastyle.failonviolationtrue/scalastyle.failonviolation
  yarn.version${hadoop.version}/yarn.version
  hbase.version0.94.6/hbase.version
  flume.version1.4.0/flume.version
 @@ -1071,7 +1072,7 @@
  version0.4.0/version
  configuration
verbosefalse/verbose
 -  failOnViolationtrue/failOnViolation
 +
  failOnViolation${scalastyle.failonviolation}/failOnViolation
includeTestSourceDirectoryfalse/includeTestSourceDirectory
failOnWarningfalse/failOnWarning
sourceDirectory${basedir}/src/main/scala/sourceDirectory



 On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com
 wrote:

 Hey Ted,
 i tried:
 mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

 no luck, still get
 [ERROR] Failed to execute goal
 org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
 spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
 violation(s). - [Help 1]


 On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh
 with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was
 because of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors
 somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code
 that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to
 come back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org








Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/LgpTk2Pnw6O/andrew+apache+mirrorsubj=Re+All+mirrored+download+links+from+the+Apache+Hadoop+site+are+broken

Cheers

On Wed, Nov 5, 2014 at 7:36 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 As part of my work for SPARK-3821
 https://issues.apache.org/jira/browse/SPARK-3821, I tried building an
 AMI
 today using create_image.sh.

 This line
 
 https://github.com/mesos/spark-ec2/blob/f6773584dd71afc49f1225be48439653313c0341/create_image.sh#L68
 
 appears to be broken now (it wasn’t a week or so ago).

 This link appears to be broken:

 http://apache.mirrors.tds.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

 Is this temporary? Should we update this to something else?

 Nick
 ​



Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Ted Yu
The artifacts are in archive:
http://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/

Cheers

On Nov 5, 2014, at 8:07 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

 Nope, thanks for pointing me to it.
 
 Doesn't look like there is a resolution to the issue. Also, the like you 
 pointed to also appears to be broken now: 
 http://apache.mesi.com.ar/hadoop/common/
 
 Nick
 
 On Wed, Nov 5, 2014 at 10:43 PM, Ted Yu yuzhih...@gmail.com wrote:
 Have you seen this thread ?
 
 http://search-hadoop.com/m/LgpTk2Pnw6O/andrew+apache+mirrorsubj=Re+All+mirrored+download+links+from+the+Apache+Hadoop+site+are+broken
 
 Cheers
 
 On Wed, Nov 5, 2014 at 7:36 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 As part of my work for SPARK-3821
 https://issues.apache.org/jira/browse/SPARK-3821, I tried building an AMI
 today using create_image.sh.
 
 This line
 https://github.com/mesos/spark-ec2/blob/f6773584dd71afc49f1225be48439653313c0341/create_image.sh#L68
 appears to be broken now (it wasn’t a week or so ago).
 
 This link appears to be broken:
 http://apache.mirrors.tds.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
 
 Is this temporary? Should we update this to something else?
 
 Nick
 


Re: Has anyone else observed this build break?

2014-11-15 Thread Ted Yu
Sorry for the late reply.

I tested my patch on Mac with the following JDK:

java version 1.7.0_60
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)

Let me see if the problem can be solved upstream in HBase hbase-annotations
module.

Cheers

On Fri, Nov 14, 2014 at 12:32 PM, Patrick Wendell pwend...@gmail.com
wrote:

 I think in this case we can probably just drop that dependency, so
 there is a simpler fix. But mostly I'm curious whether anyone else has
 observed this.

 On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
  Seems like a comment on that page mentions a fix, which would add yet
  another profile though -- specifically telling mvn that if it is an apple
  jdk, use the classes.jar as the tools.jar as well, since Apple-packaged
 JDK
  6 bundled them together.
 
  Link:
  http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
 
  I didn't test it, but maybe this can fix it?
 
  Thanks,
  Hari
 
 
  On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  A work around for this fix is identified here:
 
 
 http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
 
  However, if this affects more users I'd prefer to just fix it properly
  in our build.
 
  On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   A recent patch broke clean builds for me, I am trying to see how
   widespread this issue is and whether we need to revert the patch.
  
   The error I've seen is this when building the examples project:
  
   spark-examples_2.10: Could not resolve dependencies for project
   org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
   find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
  
  
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
  
   The reason for this error is that hbase-annotations is using a
   system scoped dependency in their hbase-annotations pom, and this
   doesn't work with certain JDK layouts such as that provided on Mac OS:
  
  
  
 http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
  
   Has anyone else seen this or is it just me?
  
   - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Has anyone else observed this build break?

2014-11-15 Thread Ted Yu
I couldn't reproduce the problem using:

java version 1.6.0_65
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)

Since hbase-annotations is a transitive dependency, I created the following
pull request to exclude it from various hbase modules:
https://github.com/apache/spark/pull/3286

Cheers

https://github.com/apache/spark/pull/3286

On Sat, Nov 15, 2014 at 6:56 AM, Ted Yu yuzhih...@gmail.com wrote:

 Sorry for the late reply.

 I tested my patch on Mac with the following JDK:

 java version 1.7.0_60
 Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
 Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)

 Let me see if the problem can be solved upstream in HBase hbase-annotations
 module.

 Cheers

 On Fri, Nov 14, 2014 at 12:32 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 I think in this case we can probably just drop that dependency, so
 there is a simpler fix. But mostly I'm curious whether anyone else has
 observed this.

 On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
  Seems like a comment on that page mentions a fix, which would add yet
  another profile though -- specifically telling mvn that if it is an
 apple
  jdk, use the classes.jar as the tools.jar as well, since Apple-packaged
 JDK
  6 bundled them together.
 
  Link:
  http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
 
  I didn't test it, but maybe this can fix it?
 
  Thanks,
  Hari
 
 
  On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  A work around for this fix is identified here:
 
 
 http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
 
  However, if this affects more users I'd prefer to just fix it properly
  in our build.
 
  On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   A recent patch broke clean builds for me, I am trying to see how
   widespread this issue is and whether we need to revert the patch.
  
   The error I've seen is this when building the examples project:
  
   spark-examples_2.10: Could not resolve dependencies for project
   org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
   find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
  
  
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
  
   The reason for this error is that hbase-annotations is using a
   system scoped dependency in their hbase-annotations pom, and this
   doesn't work with certain JDK layouts such as that provided on Mac
 OS:
  
  
  
 http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
  
   Has anyone else seen this or is it just me?
  
   - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 





Re: How spark and hive integrate in long term?

2014-11-21 Thread Ted Yu
bq. spark-0.12 also has some nice feature added

Minor correction: you meant Spark 1.2.0 I guess

Cheers

On Fri, Nov 21, 2014 at 3:45 PM, Zhan Zhang zzh...@hortonworks.com wrote:

 Thanks Dean, for the information.

 Hive-on-spark is nice. Spark sql has the advantage to take the full
 advantage of spark and allows user to manipulate the table as RDD through
 native spark support.

 When I tried to upgrade the current hive-0.13.1 support to hive-0.14.0. I
 found the hive parser is not compatible any more. In the meantime, those
 new feature introduced in hive-0.14.1, e.g, ACID, etc, is not there yet. In
 the meantime, spark-0.12 also
 has some nice feature added which is supported by thrift-server too, e.g.,
 hive-0.13, table cache, etc.

 Given that both have more and more features added, it would be great if
 user can take advantage of both. Current, spark sql give us such benefits
 partially, but I am wondering how to keep such integration in long term.

 Thanks.

 Zhan Zhang

 On Nov 21, 2014, at 3:12 PM, Dean Wampler deanwamp...@gmail.com wrote:

  I can't comment on plans for Spark SQL's support for Hive, but several
  companies are porting Hive itself onto Spark:
 
 
 http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
 
  I'm not sure if they are leveraging the old Shark code base or not, but
 it
  appears to be a fresh effort.
 
  dean
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition
  http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
  Typesafe http://typesafe.com
  @deanwampler http://twitter.com/deanwampler
  http://polyglotprogramming.com
 
  On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang zhaz...@gmail.com wrote:
 
  Now Spark and hive integration is a very nice feature. But I am
 wondering
  what the long term roadmap is for spark integration with hive. Both of
  these
  two projects are undergoing fast improvement and changes. Currently, my
  understanding is that spark hive sql part relies on hive meta store and
  basic parser to operate, and the thrift-server intercept hive query and
  replace it with its own engine.
 
  With every release of hive, there need a significant effort on spark
 part
  to
  support it.
 
  For the metastore part, we may possibly replace it with hcatalog. But
 given
  the dependency of other parts on hive, e.g., metastore, thriftserver,
  hcatlog may not be able to help much.
 
  Does anyone have any insight or idea in mind?
 
  Thanks.
 
  Zhan Zhang
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Required file not found in building

2014-12-01 Thread Ted Yu
I tried the same command on MacBook and didn't experience the same error.

Which OS are you using ?

Cheers

On Mon, Dec 1, 2014 at 6:42 PM, Stephen Boesch java...@gmail.com wrote:

 It seems there were some additional settings required to build spark now .
 This should be a snap for most of you ot there about what I am missing.
 Here is the command line I have traditionally used:

mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests

 That command line is however failing with the lastest from HEAD:

 INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.10 ---
 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 *[error] Required file not found: scala-compiler-2.10.4.jar*

 *[error] See zinc -help for information about locating necessary files*

 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS [4.077s]
 [INFO] Spark Project Networking .. FAILURE [0.445s]


 OK let's try zinc -help:

 18:38:00/spark2 $*zinc -help*
 Nailgun server running with 1 cached compiler

 Version = 0.3.5.1

 Zinc compiler cache limit = 5
 Resident scalac cache limit = 0
 Analysis cache limit = 5

 Compiler(Scala 2.10.4) [74ff364f]
 Setup = {
 *   scala compiler =

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
scala library =

 /Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
scala extra = {


 /Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
   /shared/zinc-0.3.5.1/lib/scala-reflect.jar
}
sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
compiler interface sources =
 /shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
java home =
fork java = false
cache directory = /Users/steve/.zinc/0.3.5.1
 }

 Does that compiler jar exist?  Yes!

 18:39:34/spark2 $ll

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
 -rw-r--r--  1 steve  staff  14445780 Apr  9  2014

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar



Re: Required file not found in building

2014-12-01 Thread Ted Yu
I used the following for brew:
http://repo.typesafe.com/typesafe/zinc/com/typesafe/zinc/dist/0.3.0/zinc-0.3.0.tgz

After starting zinc, I issued the same mvn command but didn't encounter the
error you saw.

FYI

On Mon, Dec 1, 2014 at 8:18 PM, Stephen Boesch java...@gmail.com wrote:

 The zinc src zip for  0.3.5.3 was  downloaded  and exploded. Then I  ran
 sbt dist/create .  zinc is being launched from
 dist/target/zinc-0.3.5.3/bin/zinc

 2014-12-01 20:12 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 I use zinc 0.2.0 and started zinc with the same command shown below.

 I don't observe such error.

 How did you install zinc-0.3.5.3 ?

 Cheers

 On Mon, Dec 1, 2014 at 8:00 PM, Stephen Boesch java...@gmail.com wrote:


 Anyone maybe can assist on how to run zinc with the latest maven build?

 I am starting zinc as follows:

 /shared/zinc-0.3.5.3/dist/target/zinc-0.3.5.3/bin/zinc -scala-home
 $SCALA_HOME -nailed -start

 The pertinent env vars are:


 19:58:11/lib $echo $SCALA_HOME
 /shared/scala
 19:58:14/lib $which scala
 /shared/scala/bin/scala
 19:58:16/lib $scala -version
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL


 When I do *not *start zinc then the maven build works .. but v slowly
 since no incremental compiler available.

 When zinc is started as shown above then the error occurs on all of the
 modules except parent:


 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [error] Required file not found: scala-compiler-2.10.4.jar
 [error] See zinc -help for information about locating necessary files

 2014-12-01 19:02 GMT-08:00 Stephen Boesch java...@gmail.com:

 Mac as well.  Just found the problem:  I had created an alias to zinc a
 couple of months back. Apparently that is not happy with the build anymore.
 No problem now that the issue has been isolated - just need to fix my zinc
 alias.

 2014-12-01 18:55 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 I tried the same command on MacBook and didn't experience the same
 error.

 Which OS are you using ?

 Cheers

 On Mon, Dec 1, 2014 at 6:42 PM, Stephen Boesch java...@gmail.com
 wrote:

 It seems there were some additional settings required to build spark
 now .
 This should be a snap for most of you ot there about what I am
 missing.
 Here is the command line I have traditionally used:

mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests

 That command line is however failing with the lastest from HEAD:

 INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.10 ---
 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 *[error] Required file not found: scala-compiler-2.10.4.jar*

 *[error] See zinc -help for information about locating necessary
 files*

 [INFO]

 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS
 [4.077s]
 [INFO] Spark Project Networking .. FAILURE
 [0.445s]


 OK let's try zinc -help:

 18:38:00/spark2 $*zinc -help*
 Nailgun server running with 1 cached compiler

 Version = 0.3.5.1

 Zinc compiler cache limit = 5
 Resident scalac cache limit = 0
 Analysis cache limit = 5

 Compiler(Scala 2.10.4) [74ff364f]
 Setup = {
 *   scala compiler =

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
scala library =

 /Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
scala extra = {


 /Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
   /shared/zinc-0.3.5.1/lib/scala-reflect.jar
}
sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
compiler interface sources =
 /shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
java home =
fork java = false
cache directory = /Users/steve/.zinc/0.3.5.1
 }

 Does that compiler jar exist?  Yes!

 18:39:34/spark2 $ll

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
 -rw-r--r--  1 steve  staff  14445780 Apr  9  2014

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar









Re: Unit tests in 5 minutes

2014-12-04 Thread Ted Yu
Have you seen this thread http://search-hadoop.com/m/JW1q5xxSAa2 ?

Test categorization in HBase is done through maven-surefire-plugin

Cheers

On Thu, Dec 4, 2014 at 4:05 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 fwiw, when we did this work in HBase, we categorized the tests. Then some
 tests can share a single jvm, while some others need to be isolated in
 their own jvm. Nevertheless surefire can still run them in parallel by
 starting/stopping several jvm.

 I think we need to do this as well. Perhaps the test naming hierarchy can
 be used to group non-parallelizable tests in the same JVM.

 For example, here are some Hive tests from our project:

 org.apache.spark.sql.hive.StatisticsSuite
 org.apache.spark.sql.hive.execution.HiveQuerySuite
 org.apache.spark.sql.QueryTest
 org.apache.spark.sql.parquet.HiveParquetSuite

 If we group tests by the first 5 parts of their name (e.g.
 org.apache.spark.sql.hive), then we’d have the first 2 tests run in the
 same JVM, and the next 2 tests each run in their own JVM.

 I’m new to this stuff so I’m not sure if I’m going about this in the right
 way, but you can see my attempt with this approach on GitHub
 https://github.com/nchammas/spark/blob/ab127b798dbfa9399833d546e627f9651b060918/project/SparkBuild.scala#L388-L397,
 as well as the related discussion on JIRA
 https://issues.apache.org/jira/browse/SPARK-3431.

 If anyone has more feedback on this, I’d love to hear it (either on this
 thread or in the JIRA issue).

 Nick
 ​

 On Sun Sep 07 2014 at 8:28:51 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On Fri, Aug 8, 2014 at 1:12 PM, Reynold Xin r...@databricks.com wrote:

 Nick,

 Would you like to file a ticket to track this?


 SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431:
 Parallelize execution of tests
  Sub-task: SPARK-3432 https://issues.apache.org/jira/browse/SPARK-3432:
 Fix logging of unit test execution time

 Nick




Re: Unit tests in 5 minutes

2014-12-06 Thread Ted Yu
bq. I may move on to trying Maven.

Maven is my favorite :-)

On Sat, Dec 6, 2014 at 10:54 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Ted,

 I posted some updates
 https://issues.apache.org/jira/browse/SPARK-3431?focusedCommentId=14236540page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14236540
  on
 JIRA on my progress (or lack thereof) getting SBT to parallelize test
 suites properly. I'm currently stuck with SBT / ScalaTest, so I may move on
 to trying Maven.

 Andrew,

 Once we have a basic grasp of how to parallelize some of the tests, the
 next step will probably be to use containers (i.e. Docker) to allow more
 parallelization, especially for those tests that, for example, contend for
 ports.

 Nick

 On Fri Dec 05 2014 at 2:05:29 PM Andrew Or and...@databricks.com wrote:

 @Patrick and Josh actually we went even further than that. We simply
 disable the UI for most tests and these used to be the single largest
 source of port conflict.




Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-19 Thread Ted Yu
Andy:
I saw two emails from you from yesterday.

See this thread: http://search-hadoop.com/m/JW1q5opRsY1

Cheers

On Fri, Dec 19, 2014 at 12:51 PM, Andy Konwinski andykonwin...@gmail.com
wrote:

 Yesterday, I changed the domain name in the mailing list archive settings
 to remove .incubator so maybe it'll work now.

 However, I also sent two emails about this through the nabble interface
 (in this same thread) yesterday and they don't appear to have made it
 through so not sure if it actually worked after all.

 Andy

 On Wed, Dec 17, 2014 at 1:09 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Yeah, it looks like messages that are successfully posted via Nabble end
 up on the Apache mailing list, but messages posted directly to Apache
 aren't mirrored to Nabble anymore because it's based off the incubator
 mailing list.  We should fix this so that Nabble posts to / archives the
 non-incubator list.

 On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Since you mentioned this, I had a related quandry recently -- it also
 says that the forum archives *u...@spark.incubator.apache.org
 u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org
 d...@spark.incubator.apache.org *respectively, yet the Community
 page clearly says to email the @spark.apache.org list (but the nabble
 archive is linked right there too). IMO even putting a clear explanation at
 the top

 Posting here requires that you create an account via the UI. Your
 message will be sent to both spark.incubator.apache.org and
 spark.apache.org (if that is the case, i'm not sure which alias nabble
 posts get sent to) would make things a lot more clear.

 On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com
 wrote:

 I've noticed that several users are attempting to post messages to
 Spark's user / dev mailing lists using the Nabble web UI (
 http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
 are many posts in Nabble that are not posted to the Apache lists and are
 flagged with This post has NOT been accepted by the mailing list yet.
 errors.

 I suspect that the issue is that users are not completing the sign-up
 confirmation process (
 http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
 which is preventing their emails from being accepted by the mailing list.

 I wanted to mention this issue to the Spark community to see whether
 there are any good solutions to address this.  I have spoken to users who
 think that our mailing list is unresponsive / inactive because their
 un-posted messages haven't received any replies.

 - Josh




Re: Assembly jar file name does not match profile selection

2014-12-26 Thread Ted Yu
Can you try this command ?

sbt/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive assembly

On Fri, Dec 26, 2014 at 6:15 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 I am building spark with sbt off of branch 1.2. I'm using the following
 command:

 sbt/sbt -Pyarn -Phadoop-2.3 assembly

 (http://spark.apache.org/docs/latest/building-spark.html#building-with-sbt
 )

 Although the jar file I obtain does contain the proper version of the
 hadoop libraries (v. 2.4), the assembly jar file name refers to hadoop
 v.1.0.4:

 ./assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar

 Any idea why?


 Alex



Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-30 Thread Ted Yu
I extracted org/apache/hadoop/hive/common/CompressionUtils.class from the
jar and used hexdump to view the class file.
Bytes 6 and 7 are 00 and 33, respectively.

According to http://en.wikipedia.org/wiki/Java_class_file, the jar was
produced using Java 7.

FYI

On Tue, Dec 30, 2014 at 8:09 PM, Shixiong Zhu zsxw...@gmail.com wrote:

 The major.minor version of the new org.spark-project.hive.hive-exec is
 51.0, so it will require people use JDK7. Is it intentional?

 dependency
 groupIdorg.spark-project.hive/groupId
 artifactIdhive-exec/artifactId
 version0.12.0-protobuf-2.5/version
 /dependency

 You can use the following steps to reproduce it (Need to use JDK6):

 1. Create a Test.java file with the following content:

 public class Test {

 public static void main(String[] args) throws Exception{
Class.forName(org.apache.hadoop.hive.conf.HiveConf);
 }

 }

 2. javac Test.java
 3. java -classpath

 ~/.m2/repository/org/spark-project/hive/hive-exec/0.12.0-protobuf-2.5/hive-exec-0.12.0-protobuf-2.5.jar:.
 Test

 Exception in thread main java.lang.UnsupportedClassVersionError:
 org/apache/hadoop/hive/conf/HiveConf : Unsupported major.minor version 51.0
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:169)
 at Test.main(Test.java:5)


 Best Regards,
 Shixiong Zhu



Re: Welcoming three new committers

2015-02-03 Thread Ted Yu
Congratulations, Cheng, Joseph and Sean.

On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Congratulations guys!

 On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Hi all,
 
  The PMC recently voted to add three new committers: Cheng Lian, Joseph
  Bradley and Sean Owen. All three have been major contributors to Spark in
  the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and
 many
  pieces throughout Spark Core. Join me in welcoming them as committers!
 
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Standardized Spark dev environment

2015-01-20 Thread Ted Yu
How many profiles (hadoop / hive /scala) would this development environment
support ?

Cheers

On Tue, Jan 20, 2015 at 4:13 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 What do y'all think of creating a standardized Spark development
 environment, perhaps encoded as a Vagrantfile, and publishing it under
 `dev/`?

 The goal would be to make it easier for new developers to get started with
 all the right configs and tools pre-installed.

 If we use something like Vagrant, we may even be able to make it so that a
 single Vagrantfile creates equivalent development environments across OS X,
 Linux, and Windows, without having to do much (or any) OS-specific work.

 I imagine for committers and regular contributors, this exercise may seem
 pointless, since y'all are probably already very comfortable with your
 workflow.

 I wonder, though, if any of you think this would be worthwhile as a
 improvement to the new Spark developer experience.

 Nick



Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver

2015-01-18 Thread Ted Yu
Please tale a look at SPARK-4048 and SPARK-5108

Cheers

On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote:

 Hi,

 I took a source code of Spark 1.2.0 and tried to build it together with
 hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift )
 I used Hadoop 2.6.0.

 The build was fine without problems, however in run time, while trying to
 access swift:// name space i got an exception:
 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
  at

 org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524)
  at

 org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732)
 ...and the long stack trace goes here

 Digging into the problem i saw the following:
 Jackson versions 1.9.X are not backward compatible, in particular they
 removed JsonClass annotation.
 Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to
 older version of jackson.

 This is the main  pom.xml of Spark 1.2.0 :

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.8.8/version
   /dependency

 Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 .
 If we change version to 1.9.13, than all will work fine and there will be
 no run time exceptions while accessing Swift. The following change will
 solve the problem:

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.9.13/version
   /dependency

 I am trying to resolve this somehow so people will not get into this
 issue.
 Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13?
 Can we remove 1.8.8 and put 1.9.13 for Avro?
 It looks to me that all works fine when Spark build with jackson 1.9.13,
 but i am not an expert and not sure what should be tested.

 Thanks,
 Gil Vernik.



Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
After some googling / trial and error, I got the following working (against
a directory with space in its name):

#!/usr/bin/env bash
OLDIFS=$IFS  # save it
IFS= # don't split on any white space
dir=$1/*
for f in $dir; do
  cat $f
done
IFS=$OLDIFS # restore IFS

Cheers

On Wed, Feb 11, 2015 at 2:47 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 The tragic thing here is that I was asked to review the patch that
 introduced this
 https://github.com/apache/spark/pull/3377#issuecomment-68077315, and
 totally missed it... :(

 On Wed Feb 11 2015 at 2:46:35 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 lol yeah, I changed the path for the email... turned out to be the issue
 itself.


 On Wed Feb 11 2015 at 2:43:09 PM Ted Yu yuzhih...@gmail.com wrote:

 I see.
 '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)

 On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Found it:

 https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-
 73058f8e51951ec0b4cb3d48ade91a1fR73

 GRRR BASH WORD SPLITTING

 My path has a space in it...

 Nick

 On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 This is what get:

 spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
 datanucleus-api-jdo-3.2.6.jar
 datanucleus-core-3.2.10.jar
 datanucleus-rdbms-3.2.9.jar
 spark-1.2.1-yarn-shuffle.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 spark-examples-1.2.1-hadoop2.4.0.jar

 So that looks correct… Hmm.

 Nick
 ​

 On Wed Feb 11 2015 at 2:34:51 PM Ted Yu yuzhih...@gmail.com wrote:

 I downloaded 1.2.1 tar ball for hadoop 2.4
 I got:

 ls lib/
 datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
  spark-examples-1.2.1-hadoop2.4.0.jar

 FYI

 On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
 sbin/start-all.sh
 on my OS X.

 Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoo
 p2.4/lib
 You need to build Spark before running this program.

 Did the same for 1.2.0 and it worked fine.

 Nick
 ​






Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I downloaded 1.2.1 tar ball for hadoop 2.4
I got:

ls lib/
datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
spark-assembly-1.2.1-hadoop2.4.0.jar
datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
 spark-examples-1.2.1-hadoop2.4.0.jar

FYI

On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh
 on my OS X.

 Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
 You need to build Spark before running this program.

 Did the same for 1.2.0 and it worked fine.

 Nick
 ​



Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I see.
'/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)

On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Found it:


 https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73

 GRRR BASH WORD SPLITTING

 My path has a space in it...

 Nick

 On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 This is what get:

 spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
 datanucleus-api-jdo-3.2.6.jar
 datanucleus-core-3.2.10.jar
 datanucleus-rdbms-3.2.9.jar
 spark-1.2.1-yarn-shuffle.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 spark-examples-1.2.1-hadoop2.4.0.jar

 So that looks correct… Hmm.

 Nick
 ​

 On Wed Feb 11 2015 at 2:34:51 PM Ted Yu yuzhih...@gmail.com wrote:

 I downloaded 1.2.1 tar ball for hadoop 2.4
 I got:

 ls lib/
 datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
  spark-examples-1.2.1-hadoop2.4.0.jar

 FYI

 On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
 sbin/start-all.sh
 on my OS X.

 Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
 You need to build Spark before running this program.

 Did the same for 1.2.0 and it worked fine.

 Nick
 ​





Re: Intellij IDEA 14 env setup; NoClassDefFoundError when run examples

2015-01-31 Thread Ted Yu
Have you read / followed this ?

https://cwiki.apache.org/confluence/display/SPARK
/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA

Cheers

On Sat, Jan 31, 2015 at 8:01 PM, Yafeng Guo daniel.yafeng@gmail.com
wrote:

 Hi,

 I'm setting up a dev environment with Intellij IDEA 14. I selected profile
 scala-2.10, maven-3, hadoop 2.4, hive, hive 0.13.1. The compilation passed.
 But when I try to run LogQuery in examples, I met below issue:

 Connected to the target VM, address: '127.0.0.1:37182', transport:
 'socket'
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/SparkConf
 at org.apache.spark.examples.LogQuery$.main(LogQuery.scala:46)
 at org.apache.spark.examples.LogQuery.main(LogQuery.scala)
 Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 2 more
 Disconnected from the target VM, address: '127.0.0.1:37182', transport:
 'socket'

 anyone met similar issue before? Thanks a lot

 Regards,
 Ya-Feng



Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Ted Yu
HBaseConverter is in Spark source tree. Therefore I think it makes sense
for this improvement to be accepted so that the example is more useful.

Cheers

On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Hey

 These converters are actually just intended to be examples of how to set
 up a custom converter for a specific input format. The converter interface
 is there to provide flexibility where needed, although with the new
 SparkSQL data store interface the intention is that most common use cases
 can be handled using that approach rather than custom converters.

 The intention is not to have specific converters living in Spark core,
 which is why these are in the examples project.

 Having said that, if you wish to expand the example converter for others
 reference do feel free to submit a PR.

 Ideally though, I would think that various custom converters would be part
 of external projects that can be listed with http://spark-packages.org/ I
 see your project is already listed there.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Mon, Jan 5, 2015 at 5:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 In my opinion this would be useful - there was another thread where
 returning
 only the value of first column in the result was mentioned.

 Please create a SPARK JIRA and a pull request.

 Cheers

 On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio gen.tan...@gmail.com wrote:

  Hi,
 
  In HBaseConverter.scala
  
 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
  
  , the python converter HBaseResultToStringConverter return only the
 value
  of
  first column in the result. In my opinion, it limits the utility of
 this
  converter, because it returns only one value per row and moreover it
 loses
  the other information of record, such as column:cell, timestamp.
 
  Therefore, I would like to propose some modifications about
  HBaseResultToStringConverter which will be able to return all records
 in
  the
  hbase with more complete information: I have already written some code
 in
  pythonConverters.scala
  
 
 https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
  
  and it works
 
  Is it OK to modify the code in HBaseConverters.scala, please?
  Thanks a lot in advance.
 
  Cheers
  Gen
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 





Re: Results of tests

2015-01-09 Thread Ted Yu
For a build which uses JUnit, we would see a summary such as the following (
https://builds.apache.org/job/HBase-TRUNK/6007/console):

Tests run: 2199, Failures: 0, Errors: 0, Skipped: 25


In 
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
, I don't see such statistics.


Looks like scalatest-maven-plugin can be enhanced :-)


On Fri, Jan 9, 2015 at 3:52 AM, Sean Owen so...@cloudera.com wrote:

 Hey Tony, the number of tests run could vary depending on how the
 build is configured. For example, YARN-related tests would only run
 when the yarn profile is turned on. Java 8 tests would only run under
 Java 8.

 Although I don't know that there's any reason to believe the IBM JVM
 has a problem with Spark, I see this issue that is potentially related
 to endian-ness : https://issues.apache.org/jira/browse/SPARK-2018 I
 don't know if that was a Spark issue. Certainly, would be good for you
 to investigate if you are interested in resolving it.

 The Jenkins output shows you exactly what tests were run and how --
 have a look at the logs.


 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull

 On Fri, Jan 9, 2015 at 9:15 AM, Tony Reix tony.r...@bull.net wrote:
  Hi Ted
 
  Thanks for the info.
  However, I'm still unable to understand how the page:
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
  has been built.
  This page contains details I do not find in the page you indicated to me:
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
 
  As an example, I'm still unable to find these details:
  org.apache.spark
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark/
  12 mn   0
  1
  247
  248
 
  org.apache.spark.api.python
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.api.python/
 20 ms   0
  0
  2
  2
 
  org.apache.spark.bagel
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.bagel/
  7.7 s   0
  0
  4
  4
 
  org.apache.spark.broadcast
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.broadcast/
  43 s0
  0
  17
  17
 
  org.apache.spark.deploy
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy/
 16 s0
  0
  29
  29
 
  org.apache.spark.deploy.worker
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy.worker/
  0.55 s  0
  0
  12
  12
 
  
 
 
  Moreover, in my Ubuntu/x86_64 environment, I do not find 3745 tests and
 0 failures, but 3485 tests and 4 failures (when using Oracle JVM 1.7 ).
 When using IBM JVM, there are only 2566 tests and 5 failures (in same
 component: Streaming).
 
  On my PPC64BE (BE = Big-Endian)environment, the tests block after 2
 hundreds of tests.
  Is Spark independent of Little/Big-Endian stuff ?
 
  On my PPC64LE (LE = Little-Endian) environment, I have 3485 tests only
 (like on Ubuntu/x86_64 with IBM JVM), with 6 or 285 failures...
 
  So, I need to learn more about how your Jenkins environment extracts
 details about the results.
  Moreover, which JVM is used ?
 
  Do you plan to use IBM JVM in order to check that Spark and IBM JVM are
 compatible ? (they already do not look to be compatible 100% ...).
 
  Thanks
 
  Tony
 
  IBM Coop Architect  Technical Leader
  Office : +33 (0) 4 76 29 72 67
  1 rue de Provence - 38432 Échirolles - France
  www.atos.nethttp://www.atos.net/
  
  De : Ted Yu [yuzhih...@gmail.com]
  Envoyé : jeudi 8 janvier 2015 17:43
  À : Tony Reix
  Cc : dev@spark.apache.org
  Objet : Re: Results of tests
 
  Here it is:
 
  [centos] $
 /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/bin/mvn
 -DHADOOP_PROFILE=hadoop-2.4 -Dlabel=centos -DskipTests -Phadoop-2.4 -Pyarn
 -Phive clean package
 
 
  You can find the above in
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE

Re: Results of tests

2015-01-09 Thread Ted Yu
I noticed that org.apache.spark.sql.hive.execution has a lot of tests
skipped.

Is there plan to enable these tests on Jenkins (so that there is no
regression across releases) ?

Cheers

On Fri, Jan 9, 2015 at 11:46 AM, Josh Rosen rosenvi...@gmail.com wrote:

 The Test Result pages for Jenkins builds shows some nice statistics for
 the test run, including individual test times:


 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/

 Currently this only covers the Java / Scala tests, but we might be able to
 integrate the PySpark tests here, too (I think it's just a matter of
 getting the Python test runner to generate the correct test result XML
 output).

 On Fri, Jan 9, 2015 at 10:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 For a build which uses JUnit, we would see a summary such as the
 following (
 https://builds.apache.org/job/HBase-TRUNK/6007/console):

 Tests run: 2199, Failures: 0, Errors: 0, Skipped: 25


 In
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
 , I don't see such statistics.


 Looks like scalatest-maven-plugin can be enhanced :-)


 On Fri, Jan 9, 2015 at 3:52 AM, Sean Owen so...@cloudera.com wrote:

  Hey Tony, the number of tests run could vary depending on how the
  build is configured. For example, YARN-related tests would only run
  when the yarn profile is turned on. Java 8 tests would only run under
  Java 8.
 
  Although I don't know that there's any reason to believe the IBM JVM
  has a problem with Spark, I see this issue that is potentially related
  to endian-ness : https://issues.apache.org/jira/browse/SPARK-2018 I
  don't know if that was a Spark issue. Certainly, would be good for you
  to investigate if you are interested in resolving it.
 
  The Jenkins output shows you exactly what tests were run and how --
  have a look at the logs.
 
 
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
 
  On Fri, Jan 9, 2015 at 9:15 AM, Tony Reix tony.r...@bull.net wrote:
   Hi Ted
  
   Thanks for the info.
   However, I'm still unable to understand how the page:
  
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
   has been built.
   This page contains details I do not find in the page you indicated to
 me:
  
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
  
   As an example, I'm still unable to find these details:
   org.apache.spark
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark/
 
   12 mn   0
   1
   247
   248
  
   org.apache.spark.api.python
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.api.python/
 
  20 ms   0
   0
   2
   2
  
   org.apache.spark.bagel
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.bagel/
 
   7.7 s   0
   0
   4
   4
  
   org.apache.spark.broadcast
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.broadcast/
 
   43 s0
   0
   17
   17
  
   org.apache.spark.deploy
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy/
 
  16 s0
   0
   29
   29
  
   org.apache.spark.deploy.worker
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy.worker/
 
   0.55 s  0
   0
   12
   12
  
   
  
  
   Moreover, in my Ubuntu/x86_64 environment, I do not find 3745 tests
 and
  0 failures, but 3485 tests and 4 failures (when using Oracle JVM 1.7 ).
  When using IBM JVM, there are only 2566 tests and 5 failures (in same
  component: Streaming).
  
   On my PPC64BE (BE = Big-Endian)environment, the tests block after 2
  hundreds of tests.
   Is Spark independent of Little/Big-Endian stuff ?
  
   On my PPC64LE (LE = Little-Endian) environment, I have 3485 tests only
  (like on Ubuntu/x86_64 with IBM JVM), with 6 or 285 failures...
  
   So, I need to learn more about how

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Ted Yu
In my opinion this would be useful - there was another thread where returning
only the value of first column in the result was mentioned.

Please create a SPARK JIRA and a pull request.

Cheers

On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio gen.tan...@gmail.com wrote:

 Hi,

 In  HBaseConverter.scala
 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 
 , the python converter HBaseResultToStringConverter return only the value
 of
 first column in the result. In my opinion, it limits the utility of this
 converter, because it returns only one value per row and moreover it loses
 the other information of record, such as column:cell, timestamp.

 Therefore, I would like to propose some modifications about
 HBaseResultToStringConverter which will be able to return all records in
 the
 hbase with more complete information: I have already written some code in
 pythonConverters.scala
 
 https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
 
 and it works

 Is it OK to modify the code in HBaseConverters.scala, please?
 Thanks a lot in advance.

 Cheers
 Gen




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Results of tests

2015-01-08 Thread Ted Yu
Here it is:

[centos] $ 
/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/bin/mvn
-DHADOOP_PROFILE=hadoop-2.4 -Dlabel=centos -DskipTests -Phadoop-2.4
-Pyarn -Phive clean package


You can find the above in
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull


Cheers


On Thu, Jan 8, 2015 at 8:05 AM, Tony Reix tony.r...@bull.net wrote:

  Thanks !

 I've been able to see that there are 3745 tests for version 1.2.0 with
 profile Hadoop 2.4  .
 However, on my side, the maximum tests I've seen are 3485... About 300
 tests are missing on my side.
 Which Maven option has been used for producing the report file used for
 building the page:

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
   ? (I'm not authorized to look at the configuration part)

 Thx !

 Tony

  --
 *De :* Ted Yu [yuzhih...@gmail.com]
 *Envoyé :* jeudi 8 janvier 2015 16:11
 *À :* Tony Reix
 *Cc :* dev@spark.apache.org
 *Objet :* Re: Results of tests

   Please take a look at https://amplab.cs.berkeley.edu/jenkins/view/Spark/

 On Thu, Jan 8, 2015 at 5:40 AM, Tony Reix tony.r...@bull.net wrote:

 Hi,
 I'm checking that Spark works fine on a new environment (PPC64 hardware).
 I've found some issues, with versions 1.1.0, 1.1.1, and 1.2.0, even when
 running on Ubuntu on x86_64 with Oracle JVM. I'd like to know where I can
 find the results of the tests of Spark, for each version and for the
 different versions, in order to have a reference to compare my results
 with. I cannot find them on Spark web-site.
 Thx
 Tony





Re: Wrong version on the Spark documentation page

2015-03-15 Thread Ted Yu
When I enter  http://spark.apache.org/docs/latest/ into Chrome address bar,
I saw 1.3.0

Cheers

On Sun, Mar 15, 2015 at 11:12 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Cheng - what if you hold shift+refresh? For me the /latest link
 correctly points to 1.3.0

 On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian lian.cs@gmail.com
 wrote:
  It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/
 
  But this page is updated (1.3.0)
  http://spark.apache.org/docs/latest/index.html
 
  Cheng
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Error: 'SparkContext' object has no attribute 'getActiveStageIds'

2015-03-20 Thread Ted Yu
Please take a look
at core/src/main/scala/org/apache/spark/SparkStatusTracker.scala, around
line 58:
  def getActiveStageIds(): Array[Int] = {

Cheers

On Fri, Mar 20, 2015 at 3:59 PM, xing ehomec...@gmail.com wrote:

 getStageInfo in self._jtracker.getStageInfo below seems not
 implemented/included in the current python library.

def getStageInfo(self, stageId):
 
 Returns a :class:`SparkStageInfo` object, or None if the stage
 info could not be found or was garbage collected.
 
 stage = self._jtracker.getStageInfo(stageId)
 if stage is not None:
 # TODO: fetch them in batch for better performance
 attrs = [getattr(stage, f)() for f in
 SparkStageInfo._fields[1:]]
 return SparkStageInfo(stageId, *attrs)



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Error-SparkContext-object-has-no-attribute-getActiveStageIds-tp11136p11140.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: GitHub Syncing Down

2015-03-11 Thread Ted Yu
Looks like github is functioning again (I no longer encounter this problem
when pushing to hbase repo).

Do you want to give it a try ?

Cheers

On Tue, Mar 10, 2015 at 6:54 PM, Michael Armbrust mich...@databricks.com
wrote:

 FYI: https://issues.apache.org/jira/browse/INFRA-9259



Re: Jira Issues

2015-03-25 Thread Ted Yu
Issues are tracked on Apache JIRA:
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

Cheers

On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote:

 Hi there Guys.

 I want to be more collaborative to Spark, but I have two questions.


 Issues are used in Github or jira Issues?

 If so on Jira, Is there a way I can get in to see the issues?

 I've tried to login but no success.


 I'm PMC from another Apache project, flex.apache.org


 Best Regards
 Igor



Re: should we add a start-masters.sh script in sbin?

2015-03-31 Thread Ted Yu
Sounds good to me.

On Tue, Mar 31, 2015 at 6:12 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:

 Hey,

 start-slaves.sh script is able to read from slaves file and start slaves
 node in multiple boxes.
 However in standalone mode if I want to use multiple masters, I’ll have to
 start masters in each individual box, and also need to provide the list of
 masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master
 ip+port for now)
 I wonder should we create a new script called start-masters.sh to read
 conf/masters file? Also start-slaves.sh script may need to change a little
 bit so that master list can be passed to worker nodes.

 Thanks

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Ted Yu
bq. writing the output (to Amazon S3) failed

What's the value of fs.s3.maxRetries ?
Increasing the value should help.

Cheers

On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman r...@totango.com wrote:

 What about communication errors and not corrupted files?
 Both when reading input and when writing output.
 We currently experience a failure of the entire process, if the last stage
 of writing the output (to Amazon S3) failed because of a very temporary DNS
 resolution issue (easily resolved by retrying).

 *Romi Kuntsman*, *Big Data Engineer*
  http://www.totango.com

 On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik g...@il.ibm.com wrote:

  I actually saw the same issue, where we analyzed some container with few
  hundreds of GBs zip files - one was corrupted and Spark exit with
  Exception on the entire job.
  I like SPARK-6593, since it  can cover also additional cases, not just in
  case of corrupted zip files.
 
 
 
  From:   Dale Richardson dale...@hotmail.com
  To: dev@spark.apache.org dev@spark.apache.org
  Date:   29/03/2015 11:48 PM
  Subject:One corrupt gzip in a directory of 100s
 
 
 
  Recently had an incident reported to me where somebody was analysing a
  directory of gzipped log files, and was struggling to load them into
 spark
  because one of the files was corrupted - calling
  sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
  executor that was reading that file, which caused the entire job to be
  cancelled after the retry count was exceeded, without any way of catching
  and recovering from the error.  While normally I think it is entirely
  appropriate to stop execution if something is wrong with your input,
  sometimes it is useful to analyse what you can get (as long as you are
  aware that input has been skipped), and treat corrupt files as acceptable
  losses.
  To cater for this particular case I've added SPARK-6593 (PR at
  https://github.com/apache/spark/pull/5250). Which adds an option
  (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
  Input format, but to continue on with the next task.
  Ideally in this case you would want to report the corrupt file paths back
  to the master so they could be dealt with in a particular way (eg moved
 to
  a separate directory), but that would require a public API
  change/addition. I was pondering on an addition to Spark's hadoop API
 that
  could report processing status back to the master via an optional
  accumulator that collects filepath/Option(exception message) tuples so
 the
  user has some idea of what files are being processed, and what files are
  being skipped.
  Regards,Dale.
 



Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. to be able to run my tests in sbt, though, it makes the development
iterations much faster.

Was the preference for sbt due to long maven build time ?
Have you started Zinc on your machine ?

Cheers

On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com wrote:

 Has anyone else noticed very strange build behavior in the network-*
 projects?

 maven seems to the doing the right, but sbt is very inconsistent.
 Sometimes when it builds network-shuffle it doesn't know about any of the
 code in network-common.  Sometimes it will completely skip the java unit
 tests.  And then some time later, it'll suddenly decide it knows about some
 more of the java unit tests.  Its not from a simple change, like touching a
 test file, or a file the test depends on -- nor a restart of sbt.  I am
 pretty confused.


 maven had issues when I tried to add scala code to network-common, it would
 compile the scala code but not make it available to java.  I'm working
 around that by just coding in java anyhow.  I'd really like to be able to
 run my tests in sbt, though, it makes the development iterations much
 faster.

 thanks,
 Imran



Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. I have to keep cd'ing into network/common, run mvn install, then go
back to network/shuffle and run some other mvn command over there.

Yeah - been through this.

Having continuous testing for maven would be nice.

On Fri, Feb 27, 2015 at 11:31 AM, Imran Rashid iras...@cloudera.com wrote:

 well, perhaps I just need to learn to use maven better, but currently I
 find sbt much more convenient for continuously running my tests.  I do use
 zinc, but I'm looking for continuous testing.  This makes me think I need
 sbt for that:
 http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven

 1) I really like that in sbt I can run ~test-only
 com.foo.bar.SomeTestSuite (or whatever other pattern) and just leave that
 running as I code, without having to go and explicitly trigger mvn test
 and wait for the result.

 2) I find sbt's handling of sub-projects much simpler (when it works).
 I'm trying to make changes to network/common  network/shuffle, which means
 I have to keep cd'ing into network/common, run mvn install, then go back to
 network/shuffle and run some other mvn command over there.  I don't want to
 run mvn at the root project level, b/c I don't want to wait for it to
 compile all the other projects when I just want to run tests in
 network/common.  Even with incremental compiling, in my day-to-day coding I
 want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch
 branches often enough that i end up triggering a full rebuild of those
 projects even when I haven't touched them.





 On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. to be able to run my tests in sbt, though, it makes the development
 iterations much faster.

 Was the preference for sbt due to long maven build time ?
 Have you started Zinc on your machine ?

 Cheers

 On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com
 wrote:

 Has anyone else noticed very strange build behavior in the network-*
 projects?

 maven seems to the doing the right, but sbt is very inconsistent.
 Sometimes when it builds network-shuffle it doesn't know about any of the
 code in network-common.  Sometimes it will completely skip the java unit
 tests.  And then some time later, it'll suddenly decide it knows about
 some
 more of the java unit tests.  Its not from a simple change, like
 touching a
 test file, or a file the test depends on -- nor a restart of sbt.  I am
 pretty confused.


 maven had issues when I tried to add scala code to network-common, it
 would
 compile the scala code but not make it available to java.  I'm working
 around that by just coding in java anyhow.  I'd really like to be able to
 run my tests in sbt, though, it makes the development iterations much
 faster.

 thanks,
 Imran






Re: org.spark-project.jetty and guava repo locations

2015-04-02 Thread Ted Yu
Take a look at the maven-shade-plugin in pom.xml.
Here is the snippet for org.spark-project.jetty :

relocation
  patternorg.eclipse.jetty/pattern
  shadedPatternorg.spark-project.jetty/shadedPattern
  includes
includeorg.eclipse.jetty.**/include
  /includes
/relocation

On Thu, Apr 2, 2015 at 3:59 AM, Niranda Perera niranda.per...@gmail.com
wrote:

 Hi,

 I am looking for the org.spark-project.jetty and org.spark-project.guava
 repo locations but I'm unable to find it in the maven repository.

 are these publicly available?

 rgds

 --
 Niranda



Re: [sql] Dataframe how to check null values

2015-04-20 Thread Ted Yu
I found:
https://issues.apache.org/jira/browse/SPARK-6573



 On Apr 20, 2015, at 4:29 AM, Peter Rudenko petro.rude...@gmail.com wrote:
 
 Sounds very good. Is there a jira for this? Would be cool to have in 1.4, 
 because currently cannot use dataframe.describe function with NaN values, 
 need to filter manually all the columns.
 
 Thanks,
 Peter Rudenko
 
 On 2015-04-02 21:18, Reynold Xin wrote:
 Incidentally, we were discussing this yesterday. Here are some thoughts on 
 null handling in SQL/DataFrames. Would be great to get some feedback.
 
 1. Treat floating point NaN and null as the same null value. This would be 
 consistent with most SQL databases, and Pandas. This would also require some 
 inbound conversion.
 
 2. Internally, when we see a NaN value, we should mark the null bit as true, 
 and keep the NaN value. When we see a null value for a floating point field, 
 we should mark the null bit as true, and update the field to store NaN.
 
 3. Externally, for floating point values, return NaN when the value is null.
 
 4. For all other types, return null for null values.
 
 5. For UDFs, if the argument is primitive type only (i.e. does not handle 
 null) and not a floating point field, simply evaluate the expression to 
 null. This is consistent with most SQL UDFs and most programming languages' 
 treatment of NaN.
 
 
 Any thoughts on this semantics?
 
 
 On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler deanwamp...@gmail.com 
 mailto:deanwamp...@gmail.com wrote:
 
I'm afraid you're a little stuck. In Scala, the types Int, Long,
Float,
Double, Byte, and Boolean look like reference types in source
code, but
they are compiled to the corresponding JVM primitive types, which
can't be
null. That's why you get the warning about ==.
 
It might be your best choice is to use NaN as the placeholder for
null,
then create one DF using a filter that removes those values. Use
that DF to
compute the mean. Then apply a map step to the original DF to
translate the
NaN's to the mean.
 
dean
 
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com
 
On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
petro.rude...@gmail.com mailto:petro.rude...@gmail.com
wrote:
 
 Hi i need to implement MeanImputor - impute missing values with
mean. If i
 set missing values to null - then dataframe aggregation works
properly, but
 in UDF it treats null values to 0.0. Here’s example:

 |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
 df.agg(avg(_1)).first //res45: org.apache.spark.sql.Row = [2.75]
 df.withColumn(d2, callUDF({(value: Double) = value}, DoubleType,
 df(d))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
null 0.0 val
 df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
Double.NaN)).toDF
 df.agg(avg(_1)).first //res46: org.apache.spark.sql.Row =
[Double.NaN] |

 In UDF i cannot compare scala’s Double to null:

 |comparing values of types Double and Null using `==' will
always yield
 false [warn] if (value==null) meanValue else value |

 With Double.NaN instead of null i can compare in UDF, but
aggregation
 doesn’t work properly. Maybe it’s related to :
https://issues.apache.org/
 jira/browse/SPARK-6573

 Thanks,
 Peter Rudenko

 ​

 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] DataFrame function namespacing

2015-04-30 Thread Ted Yu
IMHO I would go with choice #1

Cheers

On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote:

 We definitely still have the name collision problem in SQL.

 On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
  wrote:

  Do we still have to keep the names of the functions distinct to avoid
  collisions in SQL? Or is there a plan to allow importing a namespace
 into
  SQL somehow?
 
  I ask because if we have to keep worrying about name collisions then I'm
  not sure what the added complexity of #2 and #3 buys us.
 
  Punya
 
  On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote:
 
  Scaladoc isn't much of a problem because scaladocs are grouped.
  Java/Python
  is the main problem ...
 
  See
 
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
 
  On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   My feeling is that we should have a handful of namespaces (say 4 or
 5).
  It
   becomes too cumbersome to import / remember more package names and
  having
   everything in one package makes it hard to read scaladoc etc.
  
   Thanks
   Shivaram
  
   On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
  wrote:
  
   To add a little bit more context, some pros/cons I can think of are:
  
   Option 1: Very easy for users to find the function, since they are
 all
  in
   org.apache.spark.sql.functions. However, there will be quite a large
   number
   of them.
  
   Option 2: I can't tell why we would want this one over Option 3,
 since
  it
   has all the problems of Option 3, and not as nice of a hierarchy.
  
   Option 3: Opposite of Option 1. Each package or static class has a
  small
   number of functions that are relevant to each other, but for some
   functions
   it is unclear where they should go (e.g. should min go into basic
 or
   math?)
  
  
  
  
   On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com
  wrote:
  
Before we make DataFrame non-alpha, it would be great to decide how
  we
want to namespace all the functions. There are 3 alternatives:
   
1. Put all in org.apache.spark.sql.functions. This is how SQL does
  it,
since SQL doesn't have namespaces. I estimate eventually we will
  have ~
   200
functions.
   
2. Have explicit namespaces, which is what master branch currently
  looks
like:
   
- org.apache.spark.sql.functions
- org.apache.spark.sql.mathfunctions
- ...
   
3. Have explicit namespaces, but restructure them slightly so
  everything
is under functions.
   
package object functions {
   
  // all the old functions here -- but deprecated so we keep source
compatibility
  def ...
}
   
package org.apache.spark.sql.functions
   
object mathFunc {
  ...
}
   
object basicFuncs {
  ...
}
   
   
   
  
  
  
 
 



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
+1 on ending support for Java 6.

BTW from https://www.java.com/en/download/faq/java_7.xml :
After April 2015, Oracle will no longer post updates of Java SE 7 to its
public download sites.

On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 I'm in favor of ending support for Java 6. We should also articulate a
 policy on how long we want to support current and future versions of Java
 after Oracle declares them EOL (Java 7 will be in that bucket in a matter
 of days).

 Punya
 On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote:

  something to keep in mind:  we can easily support java 6 for the build
  environment, particularly if there's a definite EOL.
 
  i'd like to fix our java versioning 'problem', and this could be a big
  instigator...  right now we're hackily setting java_home in test
 invocation
  on jenkins, which really isn't the best.  if i decide, within jenkins, to
  reconfigure every build to 'do the right thing' WRT java version, then i
  will clean up the old mess and pay down on some technical debt.
 
  or i can just install java 6 and we use that as JAVA_HOME on a
  build-by-build basis.
 
  this will be a few days of prep and another morning-long downtime if i do
  the right thing (within jenkins), and only a couple of hours the hacky
 way
  (system level).
 
  either way, we can test on java 6.  :)
 
  On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
   nicholas started it! :)
  
   for java 6 i would have said the same thing about 1 year ago: it is
  foolish
   to drop it. but i think the time is right about now.
   about half our clients are on java 7 and the other half have active
 plans
   to migrate to it within 6 months.
  
   On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com
  wrote:
  
Guys thanks for chiming in, but please focus on Java here. Python is
 an
entirely separate issue.
   
   
On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
   wrote:
   
i am not sure eol means much if it is still actively used. we have a
  lot
of clients with centos 5 (for which we still support python 2.4 in
  some
form or another, fun!). most of them are on centos 6, which means
  python
2.6. by cutting out python 2.6 you would cut out the majority of the
   actual
clusters i am aware of. unless you intention is to truly make
  something
academic i dont think that is wise.
   
On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:
   
(On that note, I think Python 2.6 should be next on the chopping
  block
sometime later this year, but that’s for another thread.)
   
(To continue the parenthetical, Python 2.6 was in fact EOL-ed in
   October
of
2013. https://www.python.org/download/releases/2.6.9/)
​
   
On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
nicholas.cham...@gmail.com
wrote:
   
 I understand the concern about cutting out users who still use
 Java
   6,
and
 I don't have numbers about how many people are still using Java
 6.

 But I want to say at a high level that I support deprecating
 older
 versions of stuff to reduce our maintenance burden and let us use
   more
 modern patterns in our code.

 Maintenance always costs way more than initial development over
 the
 lifetime of a project, and for that reason anti-support is just
  as
 important as support.

 (On that note, I think Python 2.6 should be next on the chopping
   block
 sometime later this year, but that's for another thread.)

 Nick


 On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
 
wrote:

 This has been discussed a few times in the past, but now Oracle
  has
ended
 support for Java 6 for over a year, I wonder if we should just
  drop
Java 6
 support.

 There is one outstanding issue Tom has brought to my attention:
PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding
   pull
 request to fix that.

 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920


   
   
   
   
  
 



Re: [discuss] ending support for Java 6?

2015-05-02 Thread Ted Yu
+1

On Sat, May 2, 2015 at 1:09 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 We could build on minimum jdk we support for testing pr's - which will
 automatically cause build failures in case code uses newer api ?

 Regards,
 Mridul

 On Fri, May 1, 2015 at 2:46 PM, Reynold Xin r...@databricks.com wrote:
  It's really hard to inspect API calls since none of us have the Java
  standard library in our brain. The only way we can enforce this is to
 have
  it in Jenkins, and Tom you are currently our mini-Jenkins server :)
 
  Joking aside, looks like we should support Java 6 in 1.4, and in the
  release notes include a message saying starting in 1.5 we will drop Java
 6
  support.
 
 
 
 
  On Fri, May 1, 2015 at 2:00 PM, Thomas Graves tgra...@yahoo-inc.com
 wrote:
 
  Hey folks,
 
  2 more things that broke jdk6 got committed last night/today.  Please
  watch the java api's being used until we choose to deprecate jdk6.
 
  Tom
 
 
 
On Thursday, April 30, 2015 2:04 PM, Reynold Xin r...@databricks.com
 
  wrote:
 
 
  This has been discussed a few times in the past, but now Oracle has
 ended
  support for Java 6 for over a year, I wonder if we should just drop
 Java 6
  support.
 
  There is one outstanding issue Tom has brought to my attention: PySpark
 on
  YARN doesn't work well with Java 7/8, but we have an outstanding pull
  request to fix that.
 
  https://issues.apache.org/jira/browse/SPARK-6869
  https://issues.apache.org/jira/browse/SPARK-1920
 
 
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Speeding up Spark build during development

2015-05-01 Thread Ted Yu
Pramod:
Please remember to run Zinc so that the build is faster.

Cheers

On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Pramod,

 For cluster-like tests you might want to use the same code as in mllib's
 LocalClusterSparkContext. You can rebuild only the package that you change
 and then run this main class.

 Best regards, Alexander

 -Original Message-
 From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
 Sent: Friday, May 01, 2015 1:46 AM
 To: dev@spark.apache.org
 Subject: Speeding up Spark build during development

 Hi,
 I'm making some small changes to the Spark codebase and trying it out on a
 cluster. I was wondering if there's a faster way to build than running the
 package target each time.
 Currently I'm using: mvn -DskipTests  package

 All the nodes have the same filesystem mounted at the same mount point.

 Pramod



Re: Mima test failure in the master branch?

2015-04-30 Thread Ted Yu
Looks like this has been taken care of:

commit beeafcfd6ee1e460c4d564cd1515d8781989b422
Author: Patrick Wendell patr...@databricks.com
Date:   Thu Apr 30 20:33:36 2015 -0700

Revert [SPARK-5213] [SQL] Pluggable SQL Parser Support

On Thu, Apr 30, 2015 at 7:58 PM, zhazhan zzh...@hortonworks.com wrote:

 [info] spark-sql: found 1 potential binary incompatibilities (filtered 129)
 [error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class
 org.apache.spark.sql.SQLContext does not have a correspondent in new
 version
 [error] filter with: ProblemFilters.excludeMissingMethodProblem



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
But it is hard to know how long customers stay with their most recent
download.

Cheers

On Thu, Apr 30, 2015 at 2:26 PM, Sree V sree_at_ch...@yahoo.com.invalid
wrote:

 If there is any possibility of getting the download counts,then we can use
 it as EOS criteria as well.Say, if download counts are lower than 30% (or
 another number) of Life time highest,then it qualifies for EOS.

 Thanking you.

 With Regards
 Sree


  On Thursday, April 30, 2015 2:22 PM, Sree V
 sree_at_ch...@yahoo.com.INVALID wrote:


  Hi Team,
 Should we take this opportunity to layout and evangelize a pattern for EOL
 of dependencies.I propose, we follow the official EOL of java, python,
 scala, .And add say 6-12-24 months depending on the popularity.
 Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official
 End of Support for Java 6 in SparkAnnounce 3-6 months prior to EOS.

 Thanking you.

 With Regards
 Sree


 On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin 
 van...@cloudera.com wrote:


  As for the idea, I'm +1. Spark is the only reason I still have jdk6
 around - exactly because I don't want to cause the issue that started
 this discussion (inadvertently using JDK7 APIs). And as has been
 pointed out, even J7 is about to go EOL real soon.

 Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is
 already j7-only. And when Hadoop moves away from something, it's an
 event worthy of headlines. They're still on Jetty 6!

 As for pyspark, https://github.com/apache/spark/pull/5580 should get
 rid of the last incompatibility with large assemblies, by keeping the
 python files in separate archives. If we remove support for Java 6,
 then we don't need to worry about the size of the assembly anymore.

 On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote:
  I'm firmly in favor of this.
 
  It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
  avoid any more of the long-standing 64K file limit thing that's still
  a problem for PySpark.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org









Re: unable to extract tgz files downloaded from spark

2015-05-06 Thread Ted Yu
From which site did you download the tar ball ?

Which package type did you choose (pre-built for which distro) ?

Thanks

On Wed, May 6, 2015 at 7:16 PM, Praveen Kumar Muthuswamy 
muthusamy...@gmail.com wrote:

 Hi
 I have been trying to install latest spark verison and downloaded the .tgz
 files(ex spark-1.3.1.tgz). But, I could not extract them. It complains of
 invalid tar format.
 Has any seen this issue ?

 Thanks
 Praveen



Re: Recent Spark test failures

2015-05-11 Thread Ted Yu
Makes sense.

Having high determinism in these tests would make Jenkins build stable.

On Mon, May 11, 2015 at 1:08 PM, Andrew Or and...@databricks.com wrote:

 Hi Ted,

 Yes, those two options can be useful, but in general I think the standard
 to set is that tests should never fail. It's actually the worst if tests
 fail sometimes but not others, because we can't reproduce them
 deterministically. Using -M and -A actually tolerates flaky tests to a
 certain extent, and I would prefer to instead increase the determinism in
 these tests.

 -Andrew

 2015-05-08 17:56 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 Andrew:
 Do you think the -M and -A options described here can be used in test
 runs ?
 http://scalatest.org/user_guide/using_the_runner

 Cheers

 On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote:

 Dear all,

 I'm sure you have all noticed that the Spark tests have been fairly
 unstable recently. I wanted to share a tool that I use to track which
 tests
 have been failing most often in order to prioritize fixing these flaky
 tests.

 Here is an output of the tool. This spreadsheet reports the top 10 failed
 tests this week (ending yesterday 5/5):

 https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4

 It is produced by a small project:
 https://github.com/andrewor14/spark-test-failures

 I have been filing JIRAs on flaky tests based on this tool. Hopefully we
 can collectively stabilize the build a little more as we near the release
 for Spark 1.4.

 -Andrew






Re: [PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Ted Yu
In Row#equals():

  while (i  len) {
if (apply(i) != that.apply(i)) {

'!=' should be !apply(i).equals(that.apply(i)) ?

Cheers

On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 This is really strange.

  # Spark 1.3.1
  print type(results)
 class 'pyspark.sql.dataframe.DataFrame'

  a = results.take(1)[0]

  print type(a)
 class 'pyspark.sql.types.Row'

  print pyspark.sql.types.Row
 class 'pyspark.sql.types.Row'

  print type(a) == pyspark.sql.types.Row
 False
  print isinstance(a, pyspark.sql.types.Row)
 False

 If I set a as follows, then the type checks pass fine.

 a = pyspark.sql.types.Row('name')('Nick')

 Is this a bug? What can I do to narrow down the source?

 results is a massive DataFrame of spark-perf results.

 Nick
 ​



Re: Build fail...

2015-05-08 Thread Ted Yu
Looks like you're right:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console

[error] 
/home/jenkins/workspace/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:370:
value tryWithSafeFinally is not a member of object
org.apache.spark.util.Utils
[error] Utils.tryWithSafeFinally {
[error]   ^


FYI


On Fri, May 8, 2015 at 6:53 PM, rtimp dolethebobdol...@gmail.com wrote:

 Hi,

 From what I myself noticed a few minutes ago, I think branch-1.3 might be
 failing to compile due to the most recent commit. I tried reverting to
 commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the
 head) on that branch and recompiling, and was successful. As Ferris would
 say, it is so choice.



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Build-fail-tp12170p12171.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: jackson.databind exception in RDDOperationScope.jsonMapper.writeValueAsString(this)

2015-05-06 Thread Ted Yu
Looks like mismatch of jackson version.
Spark uses:
fasterxml.jackson.version2.4.4/fasterxml.jackson.version

FYI

On Wed, May 6, 2015 at 8:00 AM, A.M.Chan kaka_1...@163.com wrote:

 Hey, guys. I meet this exception while testing SQL/Columns.
 I didn't change the pom or the core project.
 In the morning, it's fine to test my PR.
 I don't know what happed.


 An exception or error caused a run to abort:
 com.fasterxml.jackson.databind.introspect.POJOPropertyBuilder.addField(Lcom/fasterxml/jackson/databind/introspect/AnnotatedField;Lcom/fasterxml/jackson/databind/PropertyName;ZZZ)V
 java.lang.NoSuchMethodError:
 com.fasterxml.jackson.databind.introspect.POJOPropertyBuilder.addField(Lcom/fasterxml/jackson/databind/introspect/AnnotatedField;Lcom/fasterxml/jackson/databind/PropertyName;ZZZ)V
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector.com
 $fasterxml$jackson$module$scala$introspect$ScalaPropertiesCollector$$_addField(ScalaPropertiesCollector.scala:109)
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector$$anonfun$_addFields$2$$anonfun$apply$11.apply(ScalaPropertiesCollector.scala:100)
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector$$anonfun$_addFields$2$$anonfun$apply$11.apply(ScalaPropertiesCollector.scala:99)
 at scala.Option.foreach(Option.scala:236)
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector$$anonfun$_addFields$2.apply(ScalaPropertiesCollector.scala:99)
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector$$anonfun$_addFields$2.apply(ScalaPropertiesCollector.scala:93)
 at
 scala.collection.GenTraversableViewLike$Filtered$$anonfun$foreach$4.apply(GenTraversableViewLike.scala:109)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.SeqLike$$anon$2.foreach(SeqLike.scala:635)
 at
 scala.collection.GenTraversableViewLike$Filtered$class.foreach(GenTraversableViewLike.scala:108)
 at scala.collection.SeqViewLike$$anon$5.foreach(SeqViewLike.scala:80)
 at
 com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector._addFields(ScalaPropertiesCollector.scala:93)
 at
 com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collect(POJOPropertiesCollector.java:233)
 at
 com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.collectProperties(BasicClassIntrospector.java:142)
 at
 com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:68)
 at
 com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:11)
 at
 com.fasterxml.jackson.databind.SerializationConfig.introspect(SerializationConfig.java:530)
 at
 com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:133)
 at
 com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1077)
 at
 com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1037)
 at
 com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:445)
 at
 com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:599)
 at
 com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:93)
 at
 com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2811)
 at
 com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2268)
 at
 org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:51)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:124)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:99)
 at org.apache.spark.SparkContext.withScope(SparkContext.scala:671)
 at org.apache.spark.SparkContext.parallelize(SparkContext.scala:685)





 --

 A.M.Chan


Re: Recent Spark test failures

2015-05-08 Thread Ted Yu
Andrew:
Do you think the -M and -A options described here can be used in test runs ?
http://scalatest.org/user_guide/using_the_runner

Cheers

On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote:

 Dear all,

 I'm sure you have all noticed that the Spark tests have been fairly
 unstable recently. I wanted to share a tool that I use to track which tests
 have been failing most often in order to prioritize fixing these flaky
 tests.

 Here is an output of the tool. This spreadsheet reports the top 10 failed
 tests this week (ending yesterday 5/5):

 https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4

 It is produced by a small project:
 https://github.com/andrewor14/spark-test-failures

 I have been filing JIRAs on flaky tests based on this tool. Hopefully we
 can collectively stabilize the build a little more as we near the release
 for Spark 1.4.

 -Andrew



Re: How to link code pull request with JIRA ID?

2015-05-13 Thread Ted Yu
Subproject tag should follow SPARK JIRA number.
e.g.

[SPARK-5277][SQL] ...

Cheers

On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com wrote:

 following up from Nicholas, it is

 [SPARK-12345] Your PR description

 where 12345 is the jira number.


 One thing I tend to forget is when/where to include the subproject tag e.g.
  [MLLIB]


 2015-05-13 11:11 GMT-07:00 Nicholas Chammas nicholas.cham...@gmail.com:

  That happens automatically when you open a PR with the JIRA key in the PR
  title.
 
  On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar 
  shekhar.kote...@gmail.com wrote:
 
   Hi,
  
   I am new to open source contribution and trying to understand the
 process
   starting from pulling code to uploading patch.
  
   I have managed to pull code from GitHub. In JIRA I saw that each JIRA
  issue
   is connected with pull request. I would like to know how do people
 attach
   pull request details to JIRA issue?
  
   Thanks,
   Chandrash3khar Kotekar
   Mobile - +91 8600011455
  
 



Re: Recent Spark test failures

2015-05-15 Thread Ted Yu
Jenkins build against hadoop 2.4 has been unstable recently:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/

I haven't found the test which hung / failed in recent Jenkins builds.

But PR builder has several green builds lately:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

Maybe PR builder doesn't build against hadoop 2.4 ?

Cheers

On Mon, May 11, 2015 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:

 Makes sense.

 Having high determinism in these tests would make Jenkins build stable.

 On Mon, May 11, 2015 at 1:08 PM, Andrew Or and...@databricks.com wrote:

 Hi Ted,

 Yes, those two options can be useful, but in general I think the standard
 to set is that tests should never fail. It's actually the worst if tests
 fail sometimes but not others, because we can't reproduce them
 deterministically. Using -M and -A actually tolerates flaky tests to a
 certain extent, and I would prefer to instead increase the determinism in
 these tests.

 -Andrew

 2015-05-08 17:56 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 Andrew:
 Do you think the -M and -A options described here can be used in test
 runs ?
 http://scalatest.org/user_guide/using_the_runner

 Cheers

 On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote:

 Dear all,

 I'm sure you have all noticed that the Spark tests have been fairly
 unstable recently. I wanted to share a tool that I use to track which
 tests
 have been failing most often in order to prioritize fixing these flaky
 tests.

 Here is an output of the tool. This spreadsheet reports the top 10
 failed
 tests this week (ending yesterday 5/5):

 https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4

 It is produced by a small project:
 https://github.com/andrewor14/spark-test-failures

 I have been filing JIRAs on flaky tests based on this tool. Hopefully we
 can collectively stabilize the build a little more as we near the
 release
 for Spark 1.4.

 -Andrew







Re: Recent Spark test failures

2015-05-15 Thread Ted Yu
From
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32831/consoleFull
:

[info] Building Spark with these arguments: -Pyarn -Phadoop-2.3
-Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver


Should PR builder cover hadoop 2.4 as well ?


Thanks


On Fri, May 15, 2015 at 9:23 AM, Ted Yu yuzhih...@gmail.com wrote:

 Jenkins build against hadoop 2.4 has been unstable recently:

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/

 I haven't found the test which hung / failed in recent Jenkins builds.

 But PR builder has several green builds lately:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

 Maybe PR builder doesn't build against hadoop 2.4 ?

 Cheers

 On Mon, May 11, 2015 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:

 Makes sense.

 Having high determinism in these tests would make Jenkins build stable.

 On Mon, May 11, 2015 at 1:08 PM, Andrew Or and...@databricks.com wrote:

 Hi Ted,

 Yes, those two options can be useful, but in general I think the
 standard to set is that tests should never fail. It's actually the worst if
 tests fail sometimes but not others, because we can't reproduce them
 deterministically. Using -M and -A actually tolerates flaky tests to a
 certain extent, and I would prefer to instead increase the determinism in
 these tests.

 -Andrew

 2015-05-08 17:56 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 Andrew:
 Do you think the -M and -A options described here can be used in test
 runs ?
 http://scalatest.org/user_guide/using_the_runner

 Cheers

 On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com
 wrote:

 Dear all,

 I'm sure you have all noticed that the Spark tests have been fairly
 unstable recently. I wanted to share a tool that I use to track which
 tests
 have been failing most often in order to prioritize fixing these flaky
 tests.

 Here is an output of the tool. This spreadsheet reports the top 10
 failed
 tests this week (ending yesterday 5/5):

 https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4

 It is produced by a small project:
 https://github.com/andrewor14/spark-test-failures

 I have been filing JIRAs on flaky tests based on this tool. Hopefully
 we
 can collectively stabilize the build a little more as we near the
 release
 for Spark 1.4.

 -Andrew








Re: Recent Spark test failures

2015-05-15 Thread Ted Yu
bq. would be prohibitive to build all configurations for every push

Agreed.

Can PR builder rotate testing against hadoop 2.3, 2.4, 2.6 and 2.7 (each
test run still uses one hadoop profile) ?

This way we would have some coverage for each of the major hadoop releases.

Cheers

On Fri, May 15, 2015 at 10:30 AM, Sean Owen so...@cloudera.com wrote:

 You all are looking only at the pull request builder. It just does one
 build to sanity-check a pull request, since that already takes 2 hours and
 would be prohibitive to build all configurations for every push. There is a
 different set of Jenkins jobs that periodically tests master against a lot
 more configurations, including Hadoop 2.4.

 On Fri, May 15, 2015 at 6:02 PM, Frederick R Reiss frre...@us.ibm.com
 wrote:

 The PR builder seems to be building against Hadoop 2.3. In the log for
 the most recent successful build (
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32805/consoleFull
 ) I see:

 =
 Building Spark
 =
 [info] Compile with Hive 0.13.1
 [info] Building Spark with these arguments: -Pyarn -Phadoop-2.3
 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver
 ...
 =
 Running Spark unit tests
 =
 [info] Running Spark tests with these arguments: -Pyarn -Phadoop-2.3
 -Dhadoop.version=2.3.0 -Pkinesis-asl test

 Is anyone testing individual pull requests against Hadoop 2.4 or 2.6
 before the code is declared clean?

 Fred

 [image: Inactive hide details for Ted Yu ---05/15/2015 09:29:09
 AM---Jenkins build against hadoop 2.4 has been unstable recently: https]Ted
 Yu ---05/15/2015 09:29:09 AM---Jenkins build against hadoop 2.4 has been
 unstable recently: https://amplab.cs.berkeley.edu/jenkins/

 From: Ted Yu yuzhih...@gmail.com
 To: Andrew Or and...@databricks.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Date: 05/15/2015 09:29 AM
 Subject: Re: Recent Spark test failures
 --



 Jenkins build against hadoop 2.4 has been unstable recently:

 *https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/*
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/

 I haven't found the test which hung / failed in recent Jenkins builds.

 But PR builder has several green builds lately:
 *https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/*
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

 Maybe PR builder doesn't build against hadoop 2.4 ?

 Cheers

 On Mon, May 11, 2015 at 1:11 PM, Ted Yu *yuzhih...@gmail.com*
 yuzhih...@gmail.com wrote:

Makes sense.

Having high determinism in these tests would make Jenkins build
stable.


On Mon, May 11, 2015 at 1:08 PM, Andrew Or *and...@databricks.com*
and...@databricks.com wrote:
   Hi Ted,

   Yes, those two options can be useful, but in general I think the
   standard to set is that tests should never fail. It's actually the 
 worst if
   tests fail sometimes but not others, because we can't reproduce them
   deterministically. Using -M and -A actually tolerates flaky tests to a
   certain extent, and I would prefer to instead increase the determinism 
 in
   these tests.

   -Andrew

   2015-05-08 17:56 GMT-07:00 Ted Yu *yuzhih...@gmail.com*
   yuzhih...@gmail.com:
   Andrew:
  Do you think the -M and -A options described here can be used
  in test runs ?
  *http://scalatest.org/user_guide/using_the_runner*
  http://scalatest.org/user_guide/using_the_runner

  Cheers

  On Wed, May 6, 2015 at 5:41 PM, Andrew Or 
  *and...@databricks.com* and...@databricks.com wrote:
 Dear all,

 I'm sure you have all noticed that the Spark tests have been
 fairly
 unstable recently. I wanted to share a tool that I use to
 track which tests
 have been failing most often in order to prioritize fixing
 these flaky
 tests.

 Here is an output of the tool. This spreadsheet reports the
 top 10 failed
 tests this week (ending yesterday 5/5):

 
 *https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4*
 
 https://docs.google.com/spreadsheets/d/1Iv_UDaTFGTMad1sOQ_s4ddWr6KD3PuFIHmTSzL7LSb4

 It is produced by a small project:
 *https://github.com/andrewor14/spark-test-failures*
 https://github.com/andrewor14/spark-test-failures

 I have been filing JIRAs on flaky tests based on this tool

Re: how long does it takes for full build ?

2015-04-16 Thread Ted Yu
You can find the command at the beginning of the console output:

[centos] $ 
/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/bin/mvn
-DHADOOP_PROFILE=hadoop-2.4 -Dlabel=centos -DskipTests -Phadoop-2.4
-Pyarn -Phive clean package


On Thu, Apr 16, 2015 at 12:42 PM, Sree V sree_at_ch...@yahoo.com wrote:

 1.
 40 min+ to 1hr+, from jenkins.
 I didn't find the commands of the job. Does it require a login ?

 Part of the console output:

   git checkout -f 3ae37b93a7c299bd8b22a36248035bca5de3422f
   git rev-list de4fa6b6d12e2bee0307ffba2abfca0c33f15e45 # timeout=10
 Triggering Spark-Master-Maven-pre-YARN ? 2.0.0-mr1-cdh4.1.2,centos 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/
 Triggering Spark-Master-Maven-pre-YARN ? 1.0.4,centos 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/

 How to find the commands of these 'triggers' ?
 I am interested, whether these named triggers use -DskipTests or not.

 2.
 This page, gives examples all with -DskipTests only.
 http://spark.apache.org/docs/1.2.0/building-spark.html


 3.
 For casting VOTE to release 1.2.2-rc1,
 I am running 'mvn clean package' on spark 1.2.2-rc1 with oralce jdk8_40 on
 centos7.
 This is stuck at, from last night. i.e. almost 12 hours.
 ...
 ExternalSorterSuite:
 - empty data stream
 - few elements per partition
 - empty partitions with spilling
 - empty partitions with spilling, bypass merge-sort

 Any pointers ?

 Thanking you.

 With Regards
 Sree



   On Thursday, April 16, 2015 12:01 PM, Ted Yu yuzhih...@gmail.com
 wrote:


 You can get some idea by looking at the builds here:


 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/

 Cheers

 On Thu, Apr 16, 2015 at 11:56 AM, Sree V sree_at_ch...@yahoo.com.invalid
 wrote:

 Hi Team,
 How long does it takes for a full build 'mvn clean package' on spark
 1.2.2-rc1 ?


 Thanking you.

 With Regards
 Sree







Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

2015-04-17 Thread Ted Yu
The image didn't go through.

I think you were referring to:
  override def map[R: ClassTag](f: Row = R): RDD[R] = rdd.map(f)

Cheers

On Fri, Apr 17, 2015 at 6:07 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 I had an issue trying to use Spark SQL from Java (8 or 7), I tried to
 reproduce it in a small test case close to the actual documentation
 https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 so sorry for the long mail, but this is Java :

 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.DataFrame;
 import org.apache.spark.sql.SQLContext;

 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;

 class Movie implements Serializable {
 private int id;
 private String name;

 public Movie(int id, String name) {
 this.id = id;
 this.name = name;
 }

 public int getId() {
 return id;
 }

 public void setId(int id) {
 this.id = id;
 }

 public String getName() {
 return name;
 }

 public void setName(String name) {
 this.name = name;
 }
 }

 public class SparkSQLTest {
 public static void main(String[] args) {
 SparkConf conf = new SparkConf();
 conf.setAppName(My Application);
 conf.setMaster(local);
 JavaSparkContext sc = new JavaSparkContext(conf);

 ArrayListMovie movieArrayList = new ArrayListMovie();
 movieArrayList.add(new Movie(1, Indiana Jones));

 JavaRDDMovie movies = sc.parallelize(movieArrayList);

 SQLContext sqlContext = new SQLContext(sc);
 DataFrame frame = sqlContext.applySchema(movies, Movie.class);
 frame.registerTempTable(movies);

 sqlContext.sql(select name from movies)

 *.map(row - row.getString(0)) // this is what i would expect 
 to work *.collect();
 }
 }


 But this does not compile, here's the compilation error :

 [ERROR]
 /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47]
 method map in class org.apache.spark.sql.DataFrame cannot be applied to
 given types;
 [ERROR] *required:
 scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR *
 [ERROR]* found: (row)-Na[...]ng(0) *
 [ERROR] *reason: cannot infer type-variable(s) R *
 [ERROR] *(actual and formal argument lists differ in length) *
 [ERROR]
 /Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17]
 method map in class org.apache.spark.sql.DataFrame cannot be applied to
 given types;
 [ERROR] required:
 scala.Function1org.apache.spark.sql.Row,R,scala.reflect.ClassTagR
 [ERROR] found: (row)-row[...]ng(0)
 [ERROR] reason: cannot infer type-variable(s) R
 [ERROR] (actual and formal argument lists differ in length)
 [ERROR] - [Help 1]

 Because in the DataFrame the *map *method is defined as :

 [image: Images intégrées 1]

 And once this is translated to bytecode the actual Java signature uses a
 Function1 and adds a ClassTag parameter.
 I can try to go around this and use the scala.reflect.ClassTag$ like that :

 ClassTag$.MODULE$.apply(String.class)

 To get the second ClassTag parameter right, but then instantiating a 
 java.util.Function or using the Java 8 lambdas fail to work, and if I try to 
 instantiate a proper scala Function1... well this is a world of pain.

 This is a regression introduced by the 1.3.x DataFrame because JavaSchemaRDD 
 used to be JavaRDDLike but DataFrame's are not (and are not callable with 
 JFunctions), I can open a Jira if you want ?

 Regards,

 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: wait time between start master and start slaves

2015-04-11 Thread Ted Yu
From SparkUI.scala :

  def getUIPort(conf: SparkConf): Int = {
conf.getInt(spark.ui.port, SparkUI.DEFAULT_PORT)
  }
Better retrieve effective UI port before probing.

Cheers

On Sat, Apr 11, 2015 at 2:38 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 So basically, to tell if the master is ready to accept slaves, just poll
 http://master-node:4040 for an HTTP 200 response?
 ​

 On Sat, Apr 11, 2015 at 2:42 PM Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  Yeah from what I remember it was set defensively. I don't know of a good
  way to check if the master is up though. I guess we could poll the Master
  Web UI and see if we get a 200/ok response
 
  Shivaram
 
  On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  Check this out
  
 
 https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33
  
  (from spark-ec2):
 
  # Start Master$BIN_FOLDER/start-master.sh
 
 
  # Pause
  sleep 20
  # Start Workers$BIN_FOLDER/start-slaves.sh
 
  I know this was probably done defensively, but is there a more direct
 way
  to know when the master is ready?
 
  Nick
  ​
 
 



Re: Anyone facing problem in incremental building of individual project

2015-06-04 Thread Ted Yu
Andrew Or put in this workaround :

diff --git a/pom.xml b/pom.xml
index 0b1aaad..d03d33b 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1438,6 +1438,8 @@
 version2.3/version
 configuration
   shadedArtifactAttachedfalse/shadedArtifactAttached
+  !-- Work around MSHADE-148 --
+  createDependencyReducedPomfalse/createDependencyReducedPom
   artifactSet
 includes
   !-- At a minimum we must include this to force effective
pom generation --

FYI

On Thu, Jun 4, 2015 at 6:25 AM, Steve Loughran ste...@hortonworks.com
wrote:


  On 4 Jun 2015, at 11:16, Meethu Mathew meethu.mat...@flytxt.com wrote:

  Hi all,

  ​I added some new code to MLlib. When I am trying to build only the
 mllib project using  *mvn --projects mllib/ -DskipTests clean install*
 *​ *after setting
  export S
 PARK_PREPEND_CLASSES=true
 ​, the build is getting stuck with the following message.



  Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar.
 [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar.
 [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
 jar.
 [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded
 jar.
 [INFO] Replacing original artifact with shaded artifact.
 [INFO] Replacing
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar
 with
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml

.



  I've seen something similar in a different build,

  It looks like MSHADE-148:
 https://issues.apache.org/jira/browse/MSHADE-148
 if you apply Tom White's patch, does your problem go away?



Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Ted Yu
I got the following when running test suite:

[INFO] compiler plugin:
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
^[[0m[^[[0minfo^[[0m] ^[[0mCompiling 2 Scala sources and 1 Java source to
/home/hbase/spark-1.4.1/streaming/target/scala-2.10/test-classes...^[[0m
^[[0m[^[[31merror^[[0m]
^[[0m/home/hbase/spark-1.4.1/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala:82:
not found: type TestException^[[0m
^[[0m[^[[31merror^[[0m] ^[[0mthrow new TestException(^[[0m
^[[0m[^[[31merror^[[0m] ^[[0m  ^^[[0m
^[[0m[^[[31merror^[[0m]
^[[0m/home/hbase/spark-1.4.1/streaming/src/test/scala/org/apache/spark/streaming/scheduler/JobGeneratorSuite.scala:73:
not found: type TestReceiver^[[0m
^[[0m[^[[31merror^[[0m] ^[[0m  val inputStream = ssc.receiverStream(new
TestReceiver)^[[0m
^[[0m[^[[31merror^[[0m] ^[[0m
^^[[0m
^[[0m[^[[31merror^[[0m] ^[[0mtwo errors found^[[0m
^[[0m[^[[31merror^[[0m] ^[[0mCompile failed at Jun 25, 2015 5:12:24 PM
[1.492s]^[[0m

Has anyone else seen similar error ?

Thanks

On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 60e08e50751fe3929156de956d62faea79f5b801

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1118/
 [published as version: 1.4.1-rc1]
 https://repository.apache.org/content/repositories/orgapachespark-1119/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Saturday, June 27, at 06:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Ted Yu
Pardon.
During earlier test run, I got:

^[[32mStreamingContextSuite:^[[0m
^[[32m- from no conf constructor^[[0m
^[[32m- from no conf + spark home^[[0m
^[[32m- from no conf + spark home + env^[[0m
^[[32m- from conf with settings^[[0m
^[[32m- from existing SparkContext^[[0m
^[[32m- from existing SparkContext with settings^[[0m
^[[31m*** RUN ABORTED ***^[[0m
^[[31m  java.lang.NoSuchMethodError:
org.apache.spark.ui.JettyUtils$.createStaticHandler(Ljava/lang/String;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler;^[[0m
^[[31m  at
org.apache.spark.streaming.ui.StreamingTab.attach(StreamingTab.scala:49)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:601)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:601)^[[0m
^[[31m  at scala.Option.foreach(Option.scala:236)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:601)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply$mcV$sp(StreamingContextSuite.scala:101)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
^[[31m  at
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)^[[0m
^[[31m  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)^[[0m

The error from previous email was due to absence
of StreamingContextSuite.scala

On Fri, Jun 26, 2015 at 1:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 I got the following when running test suite:

 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 ^[[0m[^[[0minfo^[[0m] ^[[0mCompiling 2 Scala sources and 1 Java source to
 /home/hbase/spark-1.4.1/streaming/target/scala-2.10/test-classes...^[[0m
 ^[[0m[^[[31merror^[[0m]
 ^[[0m/home/hbase/spark-1.4.1/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala:82:
 not found: type TestException^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0mthrow new TestException(^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0m  ^^[[0m
 ^[[0m[^[[31merror^[[0m]
 ^[[0m/home/hbase/spark-1.4.1/streaming/src/test/scala/org/apache/spark/streaming/scheduler/JobGeneratorSuite.scala:73:
 not found: type TestReceiver^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0m  val inputStream =
 ssc.receiverStream(new TestReceiver)^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0m
   ^^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0mtwo errors found^[[0m
 ^[[0m[^[[31merror^[[0m] ^[[0mCompile failed at Jun 25, 2015 5:12:24 PM
 [1.492s]^[[0m

 Has anyone else seen similar error ?

 Thanks

 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 60e08e50751fe3929156de956d62faea79f5b801

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1118/
 [published as version: 1.4.1-rc1]
 https://repository.apache.org/content/repositories/orgapachespark-1119/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Saturday, June 27, at 06:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: problem with using mapPartitions

2015-05-30 Thread Ted Yu
bq. val result = fDB.mappartitions(testMP).collect

Not sure if you pasted the above code - there was a typo: method name
should be mapPartitions

Cheers

On Sat, May 30, 2015 at 9:44 AM, unioah uni...@gmail.com wrote:

 Hi,

 I try to aggregate the value in each partition internally.
 For example,

 Before:
 worker 1:worker 2:
 1, 2, 1 2, 1, 2

 After:
 worker 1:  worker 2:
 (1-2), (2-1)   (1-1), (2-2)

 I try to use mappartitions,
 object MyTest {
   def main(args: Array[String]) {
 val conf = new SparkConf().setAppName(This is a test)
 val sc = new SparkContext(conf)

 val fDB = sc.parallelize(List(1, 2, 1, 2, 1, 2, 5, 5, 2), 3)
 val result = fDB.mappartitions(testMP).collect
 println(result.mkString)
 sc.stop
   }

   def testMP(iter: Iterator[Int]): Iterator[(Long, Int)] = {
 var result = new LongMap[Int]()
 var cur = 0l

 while (iter.hasNext) {
   cur = iter.next.toLong
   if (result.contains(cur)) {
 result(cur) += 1
   } else {
 result += (cur, 1)
   }
 }
 result.toList.iterator
   }
 }

 But I got the error message no matter how I tried.

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependent
 Stages(DAGScheduler.scala:1204)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/05/30 10:41:21 ERROR SparkDeploySchedulerBackend: Asked to remove
 non-existent executor 1

 Anybody can help me? Thx



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/problem-with-using-mapPartitions-tp12514.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




StreamingContextSuite fails with NoSuchMethodError

2015-05-29 Thread Ted Yu
Hi,
I ran the following command on 1.4.0 RC3:

mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package

I saw the following failure:

^[[32mStreamingContextSuite:^[[0m
^[[32m- from no conf constructor^[[0m
^[[32m- from no conf + spark home^[[0m
^[[32m- from no conf + spark home + env^[[0m
^[[32m- from conf with settings^[[0m
^[[32m- from existing SparkContext^[[0m
^[[32m- from existing SparkContext with settings^[[0m
^[[31m*** RUN ABORTED ***^[[0m
^[[31m  java.lang.NoSuchMethodError:
org.apache.spark.ui.JettyUtils$.createStaticHandler(Ljava/lang/String;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler;^[[0m
^[[31m  at
org.apache.spark.streaming.ui.StreamingTab.attach(StreamingTab.scala:49)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
^[[31m  at scala.Option.foreach(Option.scala:236)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:585)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply$mcV$sp(StreamingContextSuite.scala:101)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
^[[31m  at
org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
^[[31m  at
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)^[[0m
^[[31m  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)^[[0m

Did anyone else encounter similar error ?

Cheers


Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Ted Yu
I downloaded source tar ball and ran command similar to following with:
clean package -DskipTests

Then I ran the following command. 

Fyi 



 On May 30, 2015, at 12:42 AM, Tathagata Das t...@databricks.com wrote:
 
 Did was it a clean compilation? 
 
 TD
 
 On Fri, May 29, 2015 at 10:48 PM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 I ran the following command on 1.4.0 RC3:
 
 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package
 
 I saw the following failure:
 
 ^[[32mStreamingContextSuite:^[[0m
 ^[[32m- from no conf constructor^[[0m
 ^[[32m- from no conf + spark home^[[0m
 ^[[32m- from no conf + spark home + env^[[0m
 ^[[32m- from conf with settings^[[0m
 ^[[32m- from existing SparkContext^[[0m
 ^[[32m- from existing SparkContext with settings^[[0m
 ^[[31m*** RUN ABORTED ***^[[0m
 ^[[31m  java.lang.NoSuchMethodError: 
 org.apache.spark.ui.JettyUtils$.createStaticHandler(Ljava/lang/String;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler;^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.ui.StreamingTab.attach(StreamingTab.scala:49)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
 ^[[31m  at scala.Option.foreach(Option.scala:236)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:585)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply$mcV$sp(StreamingContextSuite.scala:101)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
 ^[[31m  at 
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
 ^[[31m  at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)^[[0m
 ^[[31m  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)^[[0m
 
 Did anyone else encounter similar error ?
 
 Cheers
 


Re: Can not build master

2015-07-03 Thread Ted Yu
Here is mine:

Apache Maven 3.3.1 (cab6659f9874fa96462afef40fcf6bc033d58c1c;
2015-03-13T13:10:27-07:00)
Maven home: /home/hbase/apache-maven-3.3.1
Java version: 1.8.0_45, vendor: Oracle Corporation
Java home: /home/hbase/jdk1.8.0_45/jre
Default locale: en_US, platform encoding: UTF-8
OS name: linux, version: 2.6.32-504.el6.x86_64, arch: amd64, family:
unix

On Fri, Jul 3, 2015 at 6:05 PM, Andrew Or and...@databricks.com wrote:

 @Tarek and Ted, what maven versions are you using?

 2015-07-03 17:35 GMT-07:00 Krishna Sankar ksanka...@gmail.com:

 Patrick,
I assume an RC3 will be out for folks like me to test the
 distribution. As usual, I will run the tests when you have a new
 distribution.
 Cheers
 k/

 On Fri, Jul 3, 2015 at 4:38 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Patch that added test-jar dependencies:
 https://github.com/apache/spark/commit/bfe74b34

 Patch that originally disabled dependency reduced poms:

 https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724

 Patch that reverted the disabling of dependency reduced poms:

 https://github.com/apache/spark/commit/bc51bcaea734fe64a90d007559e76f5ceebfea9e

 On Fri, Jul 3, 2015 at 4:36 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Okay I did some forensics with Sean Owen. Some things about this bug:
 
  1. The underlying cause is that we added some code to make the tests
  of sub modules depend on the core tests. For unknown reasons this
  causes Spark to hit MSHADE-148 for *some* combinations of build
  profiles.
 
  2. MSHADE-148 can be worked around by disabling building of
  dependency reduced poms because then the buggy code path is
  circumvented. Andrew Or did this in a patch on the 1.4 branch.
  However, that is not a tenable option for us because our *published*
  pom files require dependency reduction to substitute in the scala
  version correctly for the poms published to maven central.
 
  3. As a result, Andrew Or reverted his patch recently, causing some
  package builds to start failing again (but publishing works now).
 
  4. The reason this is not detected in our test harness or release
  build is that it is sensitive to the profiles enabled. The combination
  of profiles we enable in the test harness and release builds do not
  trigger this bug.
 
  The best path I see forward right now is to do the following:
 
  1. Disable creation of dependency reduced poms by default (this
  doesn't matter for people doing a package build) so typical users
  won't have this bug.
 
  2. Add a profile that re-enables that setting.
 
  3. Use the above profile when publishing release artifacts to maven
 central.
 
  4. Hope that we don't hit this bug for publishing.
 
  - Patrick
 
  On Fri, Jul 3, 2015 at 3:51 PM, Tarek Auel tarek.a...@gmail.com
 wrote:
  Doesn't change anything for me.
 
  On Fri, Jul 3, 2015 at 3:45 PM Patrick Wendell pwend...@gmail.com
 wrote:
 
  Can you try using the built in maven build/mvn...? All of our
 builds
  are passing on Jenkins so I wonder if it's a maven version issue:
 
  https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/
 
  - Patrick
 
  On Fri, Jul 3, 2015 at 3:14 PM, Ted Yu yuzhih...@gmail.com wrote:
   Please take a look at SPARK-8781
   (https://github.com/apache/spark/pull/7193)
  
   Cheers
  
   On Fri, Jul 3, 2015 at 3:05 PM, Tarek Auel tarek.a...@gmail.com
 wrote:
  
   I found a solution, there might be a better one.
  
   https://github.com/apache/spark/pull/7217
  
   On Fri, Jul 3, 2015 at 2:28 PM Robin East robin.e...@xense.co.uk
 
   wrote:
  
   Yes me too
  
   On 3 Jul 2015, at 22:21, Ted Yu yuzhih...@gmail.com wrote:
  
   This is what I got (the last line was repeated non-stop):
  
   [INFO] Replacing original artifact with shaded artifact.
   [INFO] Replacing
  
 /home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT.jar
   with
  
  
 /home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT-shaded.jar
   [INFO] Dependency-reduced POM written at:
   /home/hbase/spark/bagel/dependency-reduced-pom.xml
   [INFO] Dependency-reduced POM written at:
   /home/hbase/spark/bagel/dependency-reduced-pom.xml
  
   On Fri, Jul 3, 2015 at 1:13 PM, Tarek Auel tarek.a...@gmail.com
 
   wrote:
  
   Hi all,
  
   I am trying to build the master, but it stucks and prints
  
   [INFO] Dependency-reduced POM written at:
   /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml
  
   build command:  mvn -DskipTests clean package
  
   Do others have the same issue?
  
   Regards,
   Tarek
  
  
  
  

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: Can not build master

2015-07-03 Thread Ted Yu
This is what I got (the last line was repeated non-stop):

[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing
/home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT.jar with
/home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT-shaded.jar
[INFO] Dependency-reduced POM written at:
/home/hbase/spark/bagel/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at:
/home/hbase/spark/bagel/dependency-reduced-pom.xml

On Fri, Jul 3, 2015 at 1:13 PM, Tarek Auel tarek.a...@gmail.com wrote:

 Hi all,

 I am trying to build the master, but it stucks and prints

 [INFO] Dependency-reduced POM written at:
 /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml

 build command:  mvn -DskipTests clean package

 Do others have the same issue?

 Regards,
 Tarek



Re: [VOTE] Release Apache Spark 1.4.1 (RC2)

2015-07-03 Thread Ted Yu
Patrick:
I used the following command:
~/apache-maven-3.3.1/bin/mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean
package

The build doesn't seem to stop.
Here is tail of build output:

[INFO] Dependency-reduced POM written at:
/home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at:
/home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml

Here is part of the stack trace for the build process:

http://pastebin.com/xL2Y0QMU

FYI

On Fri, Jul 3, 2015 at 1:15 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc2 (commit 07b95c7):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 07b95c7adf88f0662b7ab1c47e302ff5e6859606

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1120/
 [published as version: 1.4.1-rc2]
 https://repository.apache.org/content/repositories/orgapachespark-1121/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Monday, July 06, at 22:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Ted Yu
Here is the command I used:
mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package

Java: 1.8.0_45

OS:
Linux x.com 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux

Cheers

On Mon, Jun 29, 2015 at 12:04 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 @Ted, could you elaborate more on what was the test command that you ran?
 What profiles, using SBT or Maven?

 TD

 On Sun, Jun 28, 2015 at 12:21 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Krishna - this is still the current release candidate.

 - Patrick

 On Sun, Jun 28, 2015 at 12:14 PM, Krishna Sankar ksanka...@gmail.com
 wrote:
  Patrick,
 Haven't seen any replies on test results. I will byte ;o) - Should I
 test
  this version or is another one in the wings ?
  Cheers
  k/
 
  On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.4.1!
 
  This release fixes a handful of known issues in Spark 1.4.0, listed
 here:
  http://s.apache.org/spark-1.4.1
 
  The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  60e08e50751fe3929156de956d62faea79f5b801
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.1]
 
 https://repository.apache.org/content/repositories/orgapachespark-1118/
  [published as version: 1.4.1-rc1]
 
 https://repository.apache.org/content/repositories/orgapachespark-1119/
 
  The documentation corresponding to this release can be found at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.1!
 
  The vote is open until Saturday, June 27, at 06:32 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Spark 1.5.0-SNAPSHOT broken with Scala 2.11

2015-06-28 Thread Ted Yu
Spark-Master-Scala211-Compile build is green.

However it is not clear what the actual command is:

[EnvInject] - Variables injected successfully.
[Spark-Master-Scala211-Compile] $ /bin/bash /tmp/hudson8945334776362889961.sh


FYI


On Sun, Jun 28, 2015 at 6:02 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 I am building the current master branch with Scala 2.11 following these
 instructions:

 Building for Scala 2.11

 To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11
  property:

 dev/change-version-to-2.11.sh
 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package


 Here's what I'm seeing:

 log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.security.Groups).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
 more info.
 Using Spark's repl log4j profile:
 org/apache/spark/log4j-defaults-repl.properties
 To adjust logging level use sc.setLogLevel(INFO)
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/

 Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_79)
 Type in expressions to have them evaluated.
 Type :help for more information.
 15/06/29 00:42:20 ERROR ActorSystemImpl: Uncaught fatal error from thread
 [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
 ActorSystem [sparkDriver]
 java.lang.VerifyError: class akka.remote.WireFormats$AkkaControlMessage
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at
 akka.remote.transport.AkkaPduProtobufCodec$.constructControlMessagePdu(AkkaPduCodec.scala:231)
 at
 akka.remote.transport.AkkaPduProtobufCodec$.init(AkkaPduCodec.scala:153)
 at akka.remote.transport.AkkaPduProtobufCodec$.clinit(AkkaPduCodec.scala)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:733)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:703)

 What am I doing wrong?




Re: Kryo option changed

2015-05-24 Thread Ted Yu
Please update to the following:

commit c2f0821aad3b82dcd327e914c9b297e92526649d
Author: Zhang, Liye liye.zh...@intel.com
Date:   Fri May 8 09:10:58 2015 +0100

[SPARK-7392] [CORE] bugfix: Kryo buffer size cannot be larger than 2M

On Sun, May 24, 2015 at 8:04 AM, Debasish Das debasish.da...@gmail.com
wrote:

 I am May 3rd commit:

 commit 49549d5a1a867c3ba25f5e4aec351d4102444bc0

 Author: WangTaoTheTonic wangtao...@huawei.com

 Date:   Sun May 3 00:47:47 2015 +0100


 [SPARK-7031] [THRIFTSERVER] let thrift server take SPARK_DAEMON_MEMORY
 and SPARK_DAEMON_JAVA_OPTS

 On Sat, May 23, 2015 at 7:54 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which commit of master are you building off?  It looks like there was a
 bugfix for an issue related to KryoSerializer buffer configuration:
 https://github.com/apache/spark/pull/5934

 That patch was committed two weeks ago, but you mentioned that you're
 building off a newer version of master.  Could you confirm the commit that
 you're running?  If this used to work but now throws an error, then this is
 a regression that should be fixed; we shouldn't require you to perform a mb
 - kb conversion to work around this.

 On Sat, May 23, 2015 at 6:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 Pardon me.

 Please use '8192k'

 Cheers

 On Sat, May 23, 2015 at 6:24 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Tried 8mb...still I am failing on the same error...

 On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. it shuld be 8mb

 Please use the above syntax.

 Cheers

 On Sat, May 23, 2015 at 6:04 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi,

 I am on last week's master but all the examples that set up the
 following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax
 to get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb









Re: Kryo option changed

2015-05-23 Thread Ted Yu
bq. it shuld be 8mb

Please use the above syntax.

Cheers

On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi,

 I am on last week's master but all the examples that set up the following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax to
 get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb



Re: Kryo option changed

2015-05-23 Thread Ted Yu
Pardon me.

Please use '8192k'

Cheers

On Sat, May 23, 2015 at 6:24 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Tried 8mb...still I am failing on the same error...

 On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. it shuld be 8mb

 Please use the above syntax.

 Cheers

 On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 I am on last week's master but all the examples that set up the following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax to
 get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb






Re: [IMPORTANT] Committers please update merge script

2015-05-23 Thread Ted Yu
INFRA-9646 has been resolved.

FYI

On Wed, May 13, 2015 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi All - unfortunately the fix introduced another bug, which is that
 fixVersion was not updated properly. I've updated the script and had
 one other person test it.

 So committers please pull from master again thanks!

 - Patrick

 On Tue, May 12, 2015 at 6:25 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Due to an ASF infrastructure change (bug?) [1] the default JIRA
  resolution status has switched to Pending Closed. I've made a change
  to our merge script to coerce the correct status of Fixed when
  resolving [2]. Please upgrade the merge script to master.
 
  I've manually corrected JIRA's that were closed with the incorrect
  status. Let me know if you have any issues.
 
  [1] https://issues.apache.org/jira/browse/INFRA-9646
 
  [2]
 https://github.com/apache/spark/commit/1b9e434b6c19f23a01e9875a3c1966cd03ce8e2d

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Unable to build from assembly

2015-05-22 Thread Ted Yu
What version of Java do you use ?

Can you run this command first ?
build/sbt clean

BTW please see [SPARK-7498] [MLLIB] add varargs back to setDefault

Cheers

On Fri, May 22, 2015 at 7:34 AM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 Hello,

 I updated my master from upstream recently, and on running

 build/sbt assembly

 it gives me this error

 [error]
 /home/manoj/spark/examples/src/main/java/org/apache/spark/examples/ml/JavaDeveloperApiExample.java:106:
 error: MyJavaLogisticRegression is not abstract and does not override
 abstract method setDefault(ParamPair?...) in Params
 [error] class MyJavaLogisticRegression
 [error] ^
 [error]
 /home/manoj/spark/examples/src/main/java/org/apache/spark/examples/ml/JavaDeveloperApiExample.java:168:
 error: MyJavaLogisticRegressionModel is not abstract and does not override
 abstract method setDefault(ParamPair?...) in Params
 [error] class MyJavaLogisticRegressionModel
 [error] ^
 [error] 2 errors
 [error] (examples/compile:compile) javac returned nonzero exit code

 It was working fine before this.

 Could someone please guide me on what could be wrong?



 --
 Godspeed,
 Manoj Kumar,
 http://manojbits.wordpress.com
 http://goog_1017110195
 http://github.com/MechCoder



Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide better support to query engines.



 Cheers


 On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 I’m in China now, and seem to experience difficulty to access Apache Jira.
 Anyways, it appears to me  that HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 attempts to support
 Spark DataFrame inside HBase.

 If true, one question to me is whether HBase is intended to have a
 built-in query engine or not. Or it will stick with the current way as

 a k-v store with some built-in processing capabilities in the forms of
 coprocessor, custom filter, …, etc., which allows for loosely-coupled query
 engines

 built on top of it.



 Thanks,



 *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com yuzhih...@gmail.com]
 *发送时间**:* 2015年8月11日 8:54
 *收件人**:* Bing Xiao (Bing)
 *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
 *主题**:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Yan / Bing:

 Mind taking a look at HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
 DataSource to HBase-Spark Module' ?



 Thanks



 On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
 wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team








Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-21 Thread Ted Yu
I pointed hbase-spark module (in HBase project) to 1.5.0-rc1 and was able
to build the module (with proper maven repo).

FYI

On Fri, Aug 21, 2015 at 2:17 PM, mkhaitman mark.khait...@chango.com wrote:

 Just a heads up that this RC1 release is still appearing as
 1.5.0-SNAPSHOT
 (Not just me right..?)



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC1-tp13780p13792.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Ted Yu
See this thread:

http://search-hadoop.com/m/q3RTtdZv0d1btRHl/Spark+build+modulesubj=Building+Spark+Building+just+one+module+



 On Aug 19, 2015, at 1:44 AM, canan chen ccn...@gmail.com wrote:
 
 I want to work on one jira, but it is not easy to do unit test, because it 
 involves different components especially UI. spark building is pretty slow, I 
 don't want to build it each time to test my code change. I am wondering how 
 other people do ? Is there any experience can share ? Thanks
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Ted Yu
The test passes when run alone on my machine as well.

Please run test suite.

Thanks

On Mon, Jun 29, 2015 at 2:01 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 @Ted, I ran the following two commands.

 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive -DskipTests clean
 package
 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive
 -DwildcardSuites=org.apache.spark.streaming.StreamingContextSuite test

 Using Java version 1.7.0_51, the tests passed normally.



 On Mon, Jun 29, 2015 at 1:05 PM, Krishna Sankar ksanka...@gmail.com
 wrote:

 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:26 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK

 Cheers
 k/

 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 60e08e50751fe3929156de956d62faea79f5b801

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1118/
 [published as version: 1.4.1-rc1]
 https://repository.apache.org/content/repositories/orgapachespark-1119/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Saturday, June 27, at 06:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Ted Yu
Andrew:
I agree with your assessment.

Cheers

On Mon, Jun 29, 2015 at 3:33 PM, Andrew Or and...@databricks.com wrote:

 Hi Ted,

 We haven't observed a StreamingContextSuite failure on our test
 infrastructure recently. Given that we cannot reproduce it even locally it
 is unlikely that this uncovers a real bug. Even if it does I would not
 block the release on it because many in the community are waiting for a few
 important fixes. In general, there will always be outstanding issues in
 Spark that we cannot address in every release.

 -Andrew

 2015-06-29 14:29 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 The test passes when run alone on my machine as well.

 Please run test suite.

 Thanks

 On Mon, Jun 29, 2015 at 2:01 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 @Ted, I ran the following two commands.

 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive -DskipTests clean
 package
 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive
 -DwildcardSuites=org.apache.spark.streaming.StreamingContextSuite test

 Using Java version 1.7.0_51, the tests passed normally.



 On Mon, Jun 29, 2015 at 1:05 PM, Krishna Sankar ksanka...@gmail.com
 wrote:

 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:26 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word
 count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA')
 OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK

 Cheers
 k/

 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed
 here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 60e08e50751fe3929156de956d62faea79f5b801

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1118/
 [published as version: 1.4.1-rc1]
 https://repository.apache.org/content/repositories/orgapachespark-1119/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Saturday, June 27, at 06:32 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org








Re: add to user list

2015-07-30 Thread Ted Yu
Please take a look at the first section of:
https://spark.apache.org/community

On Thu, Jul 30, 2015 at 9:23 PM, Sachin Aggarwal different.sac...@gmail.com
 wrote:



 --

 Thanks  Regards

 Sachin Aggarwal
 7760502772



Re: High availability with zookeeper: worker discovery

2015-07-30 Thread Ted Yu
zookeeper is not a direct dependency of Spark.

Can you give a bit more detail on how the election / discovery of master
works ?

Cheers

On Thu, Jul 30, 2015 at 7:41 PM, Christophe Schmitz cofcof...@gmail.com
wrote:

 Hi there,

 I am trying to run a 3 node spark cluster where each nodes contains a
 spark worker and a spark maser. Election of the master happens via
 zookeeper.

 The way I am configuring it is by (on each node) giving the IP:PORT of the
 local master to the local worker, and I wish the worker could autodiscover
 the elected master automatically.

 But unfortunatly, only the local worker of the elected master registered
 to the elected master. Why aren't the other worker getting to connect to
 the elected master?

 The interessing thing is that if I kill the elected master and wait a bit,
 then the new elected master sees all the workers!

 I am wondering if I am missing something to make this happens without
 having to kill the elected master.

 Thanks!


 PS: I am on spark 1.2.2




  1   2   3   4   >