Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
I notice that some people send messages directly to user@spark.apache.org
and some via nabble, either using email or the web client.

There are two index sites, one directly at apache.org and one at nabble.
But messages sent directly to user@spark.apache.org only show up in the
apache list.  Further, it appears that you can subscribe either directly to
user@spark.apache.org, in which you see all emails, or via nabble and you
see a subset.

Is this correct and is it intentional?

Apache site:
  http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/browser

Nabble site:
  http://apache-spark-user-list.1001560.n3.nabble.com/

An example of a message that only shows up in Apache:

http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAGK53LnsD59wwQrP3-9yHc38C4eevAfMbV2so%2B_wi8k0%2Btq5HQ%40mail.gmail.com%3E


This message was sent both to Nabble and user@spark.apache.org to see how
that behaves.

Jim


Re: Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
Yes, it did get delivered to the apache list shown here:

http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAGK53LnsD59wwQrP3-9yHc38C4eevAfMbV2so%2B_wi8k0%2Btq5HQ%40mail.gmail.com%3E

But the web site for spark community directs people to nabble for viewing
messages and it doesn't show up there.

Community page: http://spark.apache.org/community.html

Link in that page to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/

The reliable archive:
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/browser



On Fri, Mar 20, 2015 at 12:34 PM, Ted Yu yuzhih...@gmail.com wrote:

 Jim:
 I can find the example message here:
 http://search-hadoop.com/m/JW1q5zP54J1

 On Fri, Mar 20, 2015 at 12:29 PM, Jim Kleckner j...@cloudphysics.com
 wrote:

 I notice that some people send messages directly to user@spark.apache.org
 and some via nabble, either using email or the web client.

 There are two index sites, one directly at apache.org and one at
 nabble.  But messages sent directly to user@spark.apache.org only show
 up in the apache list.  Further, it appears that you can subscribe either
 directly to user@spark.apache.org, in which you see all emails, or via
 nabble and you see a subset.

 Is this correct and is it intentional?

 Apache site:
   http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/browser

 Nabble site:
   http://apache-spark-user-list.1001560.n3.nabble.com/

 An example of a message that only shows up in Apache:

 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAGK53LnsD59wwQrP3-9yHc38C4eevAfMbV2so%2B_wi8k0%2Btq5HQ%40mail.gmail.com%3E


 This message was sent both to Nabble and user@spark.apache.org to see
 how that behaves.

 Jim





Reliable method/tips to solve dependency issues?

2015-03-19 Thread Jim Kleckner
Do people have a reliable/repeatable method for solving dependency issues
or tips?

The current world of spark-hadoop-hbase-parquet-... is very challenging
given the huge footprint of dependent packages and we may be pushing
against the limits of how many packages can be combined into one
environment...

The process of searching the web to pick at incompatibilities one at a time
is at best tedious and at worst non-converging.

It makes me wonder if there is (or ought to be) a page cataloging in one
place the conflicts that Spark users have hit and what was done to solve it.

Eugene Yokota wrote an interesting blog about current sbt dependency
management in sbt v 0.13.7 that includes nice improvements for working with
dependencies:
  https://typesafe.com/blog/improved-dependency-management-with-sbt-0137

After reading that, I refreshed on the sbt documentation and found show
update.  It gives very extensive information.

For reference, there was an extensive discussion thread about sbt and maven
last year that touches on a lot of topics:

http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3ccabpqxsukhd4qsf5dg9ruhn7wvonxfm+y5b1k5d8g7h6s9bh...@mail.gmail.com%3E


Re: Spark excludes fastutil dependencies we need

2015-02-27 Thread Jim Kleckner
Yes, I used both.

The discussion on this seems to be at github now:
  https://github.com/apache/spark/pull/4780

I am using more classes from a package from which spark uses HyperLogLog as
well.
So we are both including the jar file but Spark is excluding the dependent
package that is required.


On Thu, Feb 26, 2015 at 9:54 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com
 wrote:
  So, should the userClassPathFirst flag work and there is a bug?

 Sorry for jumping in the middle of conversation (and probably missing
 some of it), but note that this option applies only to executors. If
 you're trying to use the class in your driver, there's a separate
 option for that.

 Also to note is that if you're adding a class that doesn't exist
 inside the Spark jars, which seems to be the case, this option should
 be irrelevant, since the class loaders should all end up finding the
 one copy of the class that you're adding with your app.

 --
 Marcelo





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-excludes-fastutil-dependencies-we-need-tp21849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Fwd: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
Forwarding conversation below that didn't make it to the list.

-- Forwarded message --
From: Jim Kleckner j...@cloudphysics.com
Date: Wed, Feb 25, 2015 at 8:42 PM
Subject: Re: Spark excludes fastutil dependencies we need
To: Ted Yu yuzhih...@gmail.com
Cc: Sean Owen so...@cloudera.com, user user@spark.apache.org


Inline

On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Interesting. Looking at SparkConf.scala :

 val configs = Seq(
   DeprecatedConfig(spark.files.userClassPathFirst,
 spark.executor.userClassPathFirst,
 1.3),
   DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3,
 Use spark.{driver,executor}.userClassPathFirst instead.))

 It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first
 are deprecated.


Note that I did use the non-deprecated version, spark.executor.
userClassPathFirst=true.



 On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote:

 No, we should not add fastutil back. It's up to the app to bring
 dependencies it needs, and that's how I understand this issue. The
 question is really, how to get the classloader visibility right. It
 depends on where you need these classes. Have you looked into
 spark.files.userClassPathFirst and spark.yarn.user.classpath.first ?



I noted that I tried this in my original email.

The issue appears related to the fact that parquet is also creating a shaded
jar and that one leaves out the Long2LongOpenHashMap class.

FYI, I have subsequently tried removing the exclusion from the spark build
and
that does cause the fastutil classes to be included and the example works...

So, should the userClassPathFirst flag work and there is a bug?

Or is it reasonable to put in a pull request for the elimination of the
exclusion?




 On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote:
  bq. depend on missing fastutil classes like Long2LongOpenHashMap
 
  Looks like Long2LongOpenHashMap should be added to the shaded jar.
 
  Cheers

 
  On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com
 wrote:
 
  Spark includes the clearspring analytics package but intentionally
  excludes
  the dependencies of the fastutil package (see below).
 
  Spark includes parquet-column which includes fastutil and relocates it
  under
  parquet/
  but creates a shaded jar file which is incomplete because it shades out
  some
  of
  the fastutil classes, notably Long2LongOpenHashMap, which is present in
  the
  fastutil jar file that parquet-column is referencing.
 
  We are using more of the clearspring classes (e.g. QDigest) and those
 do
  depend on
  missing fastutil classes like Long2LongOpenHashMap.
 
  Even though I add them to our assembly jar file, the class loader finds
  the
  spark assembly
  and we get runtime class loader errors when we try to use it.
 
  It is possible to put our jar file first, as described here:
https://issues.apache.org/jira/browse/SPARK-939
 
 
 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment
 
  which I tried with args to spark-submit:
--conf spark.driver.userClassPathFirst=true  --conf
  spark.executor.userClassPathFirst=true
  but we still get the class not found error.
 
  We have tried copying the source code for clearspring into our package
 and
  renaming the
  package and that makes it appear to work...  Is this risky?  It
 certainly
  is
  ugly.
 
  Can anyone recommend a way to deal with this dependency **ll ?
 
 
  === The spark/pom.xml file contains the following lines:
 
dependency
  groupIdcom.clearspring.analytics/groupId
  artifactIdstream/artifactId
  version2.7.0/version
  exclusions
 
exclusion
  groupIdit.unimi.dsi/groupId
  artifactIdfastutil/artifactId
/exclusion
  /exclusions
/dependency
 
  === The parquet-column/pom.xml file contains:
  artifactIdmaven-shade-plugin/artifactId
  executions
execution
  phasepackage/phase
  goals
goalshade/goal
  /goals
  configuration
minimizeJartrue/minimizeJar
artifactSet
  includes
includeit.unimi.dsi:fastutil/include
  /includes
/artifactSet
relocations
  relocation
patternit.unimi.dsi/pattern
shadedPatternparquet.it.unimi.dsi/shadedPattern
  /relocation
/relocations
  /configuration
/execution
  /executions
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com

Re: Fwd: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
I created an issue and pull request.

Discussion can continue there:
https://issues.apache.org/jira/browse/SPARK-6029



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-excludes-fastutil-dependencies-we-need-tp21812p21814.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
Inline

On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Interesting. Looking at SparkConf.scala :

 val configs = Seq(
   DeprecatedConfig(spark.files.userClassPathFirst,
 spark.executor.userClassPathFirst,
 1.3),
   DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3,
 Use spark.{driver,executor}.userClassPathFirst instead.))

 It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first
 are deprecated.


Note that I did use the non-deprecated version, spark.executor.
userClassPathFirst=true.



 On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote:

 No, we should not add fastutil back. It's up to the app to bring
 dependencies it needs, and that's how I understand this issue. The
 question is really, how to get the classloader visibility right. It
 depends on where you need these classes. Have you looked into
 spark.files.userClassPathFirst and spark.yarn.user.classpath.first ?



I noted that I tried this in my original email.

The issue appears related to the fact that parquet is also creating a shaded
jar and that one leaves out the Long2LongOpenHashMap class.

FYI, I have subsequently tried removing the exclusion from the spark build
and
that does cause the fastutil classes to be included and the example works...

So, should the userClassPathFirst flag work and there is a bug?

Or is it reasonable to put in a pull request for the elimination of the
exclusion?




 On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote:
  bq. depend on missing fastutil classes like Long2LongOpenHashMap
 
  Looks like Long2LongOpenHashMap should be added to the shaded jar.
 
  Cheers

 
  On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com
 wrote:
 
  Spark includes the clearspring analytics package but intentionally
  excludes
  the dependencies of the fastutil package (see below).
 
  Spark includes parquet-column which includes fastutil and relocates it
  under
  parquet/
  but creates a shaded jar file which is incomplete because it shades out
  some
  of
  the fastutil classes, notably Long2LongOpenHashMap, which is present in
  the
  fastutil jar file that parquet-column is referencing.
 
  We are using more of the clearspring classes (e.g. QDigest) and those
 do
  depend on
  missing fastutil classes like Long2LongOpenHashMap.
 
  Even though I add them to our assembly jar file, the class loader finds
  the
  spark assembly
  and we get runtime class loader errors when we try to use it.
 
  It is possible to put our jar file first, as described here:
https://issues.apache.org/jira/browse/SPARK-939
 
 
 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment
 
  which I tried with args to spark-submit:
--conf spark.driver.userClassPathFirst=true  --conf
  spark.executor.userClassPathFirst=true
  but we still get the class not found error.
 
  We have tried copying the source code for clearspring into our package
 and
  renaming the
  package and that makes it appear to work...  Is this risky?  It
 certainly
  is
  ugly.
 
  Can anyone recommend a way to deal with this dependency **ll ?
 
 
  === The spark/pom.xml file contains the following lines:
 
dependency
  groupIdcom.clearspring.analytics/groupId
  artifactIdstream/artifactId
  version2.7.0/version
  exclusions
 
exclusion
  groupIdit.unimi.dsi/groupId
  artifactIdfastutil/artifactId
/exclusion
  /exclusions
/dependency
 
  === The parquet-column/pom.xml file contains:
  artifactIdmaven-shade-plugin/artifactId
  executions
execution
  phasepackage/phase
  goals
goalshade/goal
  /goals
  configuration
minimizeJartrue/minimizeJar
artifactSet
  includes
includeit.unimi.dsi:fastutil/include
  /includes
/artifactSet
relocations
  relocation
patternit.unimi.dsi/pattern
shadedPatternparquet.it.unimi.dsi/shadedPattern
  /relocation
/relocations
  /configuration
/execution
  /executions
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





Spark excludes fastutil dependencies we need

2015-02-24 Thread Jim Kleckner
Spark includes the clearspring analytics package but intentionally excludes
the dependencies of the fastutil package (see below).

Spark includes parquet-column which includes fastutil and relocates it under
parquet/
but creates a shaded jar file which is incomplete because it shades out some
of 
the fastutil classes, notably Long2LongOpenHashMap, which is present in the
fastutil jar file that parquet-column is referencing.

We are using more of the clearspring classes (e.g. QDigest) and those do
depend on
missing fastutil classes like Long2LongOpenHashMap.

Even though I add them to our assembly jar file, the class loader finds the
spark assembly
and we get runtime class loader errors when we try to use it.

It is possible to put our jar file first, as described here:
  https://issues.apache.org/jira/browse/SPARK-939
  http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment

which I tried with args to spark-submit:
  --conf spark.driver.userClassPathFirst=true  --conf
spark.executor.userClassPathFirst=true
but we still get the class not found error.

We have tried copying the source code for clearspring into our package and
renaming the
package and that makes it appear to work...  Is this risky?  It certainly is
ugly.

Can anyone recommend a way to deal with this dependency **ll ?


=== The spark/pom.xml file contains the following lines:

  dependency
groupIdcom.clearspring.analytics/groupId
artifactIdstream/artifactId
version2.7.0/version
exclusions
  
  exclusion
groupIdit.unimi.dsi/groupId
artifactIdfastutil/artifactId
  /exclusion
/exclusions
  /dependency

=== The parquet-column/pom.xml file contains:
artifactIdmaven-shade-plugin/artifactId
executions
  execution
phasepackage/phase
goals
  goalshade/goal
/goals
configuration
  minimizeJartrue/minimizeJar
  artifactSet
includes
  includeit.unimi.dsi:fastutil/include
/includes
  /artifactSet
  relocations
relocation
  patternit.unimi.dsi/pattern
  shadedPatternparquet.it.unimi.dsi/shadedPattern
/relocation
  /relocations
/configuration
  /execution
/executions




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org