Exception while running unit tests that makes use of local-cluster mode

2014-10-23 Thread Varadharajan Mukundan
Hi All,

When i try to run unit tests that makes use of local-cluster mode (Ex:
Accessing HttpBroadcast variables in a local cluster in
BroadcastSuite.scala), its failing with the below exception. I'm using
java version 1.8.0_05 and scala version  2.10. I tried to look into
the jenkins build report and its passing over there. Please let me
know how i can resolve this issue.

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 6,
192.168.43.112): java.lang.ClassNotFoundException:
org.apache.spark.broadcast.BroadcastSuite$$anonfun$3$$anonfun$19
java.net.URLClassLoader$1.run(URLClassLoader.java:372)
java.net.URLClassLoader$1.run(URLClassLoader.java:361)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:360)
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:340)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 6, 192.168.43.112):
java.lang.ClassNotFoundException:
org.apache.spark.broadcast.BroadcastSuite$$anonfun$3$$anonfun$19
java.net.URLClassLoader$1.run(URLClassLoader.java:372)
java.net.URLClassLoader$1.run(URLClassLoader.java:361)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:360)
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:340)

org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
Upvote for the multitanency requirement.

I'm also building a data analytic platform and there'll be multiple users
running queries and computations simultaneously. One of the paint point is
control of resource size. Users don't really know how much nodes they need,
they always use as much as possible... The result is lots of wasted
resource in our Yarn cluster.

A way to 1) allow multiple spark context to share the same resource or 2)
add dynamic resource management for Yarn mode is very much wanted.

Jianshi

On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
 use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
 got
  preempted say while doing reduceByKey, will the application progress with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


PR for Hierarchical Clustering Needs Review

2014-10-23 Thread RJ Nowling
Hi all,

A few months ago, I collected feedback on what the community was looking
for in clustering methods.  A number of the community members requested a
divisive hierarchical clustering method.

Yu Ishikawa has stepped up to implement such a method.  I've been working
with him to communicate what I heard the community request and to review
and improve his code.

You can find the JIRA here:
https://issues.apache.org/jira/browse/SPARK-2429

He has now submitted a PR:
https://github.com/apache/spark/pull/2906

I was hoping Xiangrui, other committers, and community members would be
willing to take a look?  It's quite a large patch so it'll need extra
attention.

Thank you,
RJ

-- 
em rnowl...@gmail.com
c 954.496.2314


Memory

2014-10-23 Thread Tom Hubregtsen
Hi all,

I would like to validate my understanding of memory regions in Spark. Any
comments on my description below would be appreciated!

Execution is split up into stages, based on wide dependencies between RDDs
and actions such as save. All transformations involving narrow dependencies
before this wide dependency (or action) are pipelined. When Spark uses HDFS,
input data is loaded into memory according to the partitioning used in the
HDFS. As Spark has three regions (general, shuffle and storage), and this
does not yet involve an explicit cache nor a shuffle, I'll assume it goes
into general. During the pipelined execution of transformations with narrow
dependencies, it stays here, using the same partitioning, until we reach a
wide dependency (or an action). It then acquires memory from the shuffle
region, and spills to disk when there is no sufficient amount of memory
available. The result is passed in an iterator (located in the general
space) and the shuffle region is freed.

Only when an RDD is explicitly cached, it moves from the general region into
the storage region. This will guarantee availability for future use, but
also save space from the general region.

Question 1: Is this correct?
Question 2: How big is the general region?
Example: When I tell Spark I have 4GB, but my system actually has 16. I see
that the parameters shuffle and storage are defined (default: 20% and 60%),
but it does not seem that the general area is bounded by this. Will spark
use:
Shuffle: 0.2*4=0.8
Storage: 0.6*4=2.4
General: 1-(0.2+0.6)*4=0.8
Or
Shuffle: max(0.2*4)=max(0.8) //As we acquire from a counter, and not
actually divided memory regions
Storage: max(0.6*4)=max(2.4)
General: 16-actualUsage(shuffle+storage) or 4-actualUsage(shuffle+storage)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-tp8916.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: reading/writing parquet decimal type

2014-10-23 Thread Michael Allman
Hi Matei,

Another thing occurred to me. Will the binary format you're writing sort the 
data in numeric order? Or would the decimals have to be decoded for comparison?

Cheers,

Michael


 On Oct 12, 2014, at 10:48 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 The fixed-length binary type can hold fewer bytes than an int64, though many 
 encodings of int64 can probably do the right thing. We can look into 
 supporting multiple ways to do this -- the spec does say that you should at 
 least be able to read int32s and int64s.
 
 Matei
 
 On Oct 12, 2014, at 8:20 PM, Michael Allman mich...@videoamp.com wrote:
 
 Hi Matei,
 
 Thanks, I can see you've been hard at work on this! I examined your patch 
 and do have a question. It appears you're limiting the precision of decimals 
 written to parquet to those that will fit in a long, yet you're writing the 
 values as a parquet binary type. Why not write them using the int64 parquet 
 type instead?
 
 Cheers,
 
 Michael
 
 On Oct 12, 2014, at 3:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Hi Michael,
 
 I've been working on this in my repo: 
 https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests 
 with these features soon, but meanwhile you can try this branch. See 
 https://github.com/mateiz/spark/compare/decimal for the individual commits 
 that went into it. It has exactly the precision stuff you need, plus some 
 optimizations for working on decimals.
 
 Matei
 
 On Oct 12, 2014, at 1:51 PM, Michael Allman mich...@videoamp.com wrote:
 
 Hello,
 
 I'm interested in reading/writing parquet SchemaRDDs that support the 
 Parquet Decimal converted type. The first thing I did was update the Spark 
 parquet dependency to version 1.5.0, as this version introduced support 
 for decimals in parquet. However, conversion between the catalyst decimal 
 type and the parquet decimal type is complicated by the fact that the 
 catalyst type does not specify a decimal precision and scale but the 
 parquet type requires them.
 
 I'm wondering if perhaps we could add an optional precision and scale to 
 the catalyst decimal type? The catalyst decimal type would have 
 unspecified precision and scale by default for backwards compatibility, 
 but users who want to serialize a SchemaRDD with decimal(s) to parquet 
 would have to narrow their decimal type(s) by specifying a precision and 
 scale.
 
 Thoughts?
 
 Michael
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Receiver/DStream storage level

2014-10-23 Thread Michael Allman
I'm implementing a custom ReceiverInputDStream and I'm not sure how to 
initialize the Receiver with the storage level. The storage level is set on the 
DStream, but there doesn't seem to be a way to pass it to the Receiver. At the 
same time, setting the storage level separately on the Receiver seems to 
introduce potential confusion as the storage level of the DStream can be set 
separately. Is this desired behavior---to have distinct DStream and Receiver 
storage levels? Perhaps I'm missing something? Also, the storageLevel property 
of the Receiver[T] class is undocumented.

Cheers,

Michael
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin
You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.

On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 Upvote for the multitanency requirement.

 I'm also building a data analytic platform and there'll be multiple users
 running queries and computations simultaneously. One of the paint point is
 control of resource size. Users don't really know how much nodes they need,
 they always use as much as possible... The result is lots of wasted resource
 in our Yarn cluster.

 A way to 1) allow multiple spark context to share the same resource or 2)
 add dynamic resource management for Yarn mode is very much wanted.

 Jianshi

 On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
  use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
  got
  preempted say while doing reduceByKey, will the application progress
  with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
100 max width seems very restrictive to me.

even the most restrictive environment i have for development (ssh with
emacs) i get a lot more characters to work with than that.

personally i find the code harder to read, not easier. like i kept
wondering why there are weird newlines in the
middle of constructors and such, only to realise later it was because of
the 100 character limit.

also, i find mvn package erroring out because of style errors somewhat
excessive. i understand that a pull request needs to conform to the style
before being accepted, but this means i cant even run tests on code that
does not conform to the style guide, which is a bit silly.

i keep going out for coffee while package and tests run, only to come back
for an annoying error that my line is 101 characters and therefore nothing
ran.

is there some maven switch to disable the style checks?

best! koert


Re: scalastyle annoys me a little bit

2014-10-23 Thread Patrick Wendell
Hey Koert,

I think disabling the style checks in maven package could be a good
idea for the reason you point out. I was sort of mixed on that when it
was proposed for this exact reason. It's just annoying to developers.

In terms of changing the global limit, this is more religion than
anything else, but there are other cases where the current limit is
useful (e.g. if you have many windows open in a large screen).

- Patrick

On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote:
 100 max width seems very restrictive to me.

 even the most restrictive environment i have for development (ssh with
 emacs) i get a lot more characters to work with than that.

 personally i find the code harder to read, not easier. like i kept
 wondering why there are weird newlines in the
 middle of constructors and such, only to realise later it was because of
 the 100 character limit.

 also, i find mvn package erroring out because of style errors somewhat
 excessive. i understand that a pull request needs to conform to the style
 before being accepted, but this means i cant even run tests on code that
 does not conform to the style guide, which is a bit silly.

 i keep going out for coffee while package and tests run, only to come back
 for an annoying error that my line is 101 characters and therefore nothing
 ran.

 is there some maven switch to disable the style checks?

 best! koert

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: scalastyle annoys me a little bit

2014-10-23 Thread Marcelo Vanzin
I know this is all very subjective, but I find long lines difficult to read.

I also like how 100 characters fit in my editor setup fine (split wide
screen), while a longer line length would mean I can't have two
buffers side-by-side without horizontal scrollbars.

I think it's fine to add a switch to skip the style tests, but then,
you'll still have to fix the issues at some point...


On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote:
 100 max width seems very restrictive to me.

 even the most restrictive environment i have for development (ssh with
 emacs) i get a lot more characters to work with than that.

 personally i find the code harder to read, not easier. like i kept
 wondering why there are weird newlines in the
 middle of constructors and such, only to realise later it was because of
 the 100 character limit.

 also, i find mvn package erroring out because of style errors somewhat
 excessive. i understand that a pull request needs to conform to the style
 before being accepted, but this means i cant even run tests on code that
 does not conform to the style guide, which is a bit silly.

 i keep going out for coffee while package and tests run, only to come back
 for an annoying error that my line is 101 characters and therefore nothing
 ran.

 is there some maven switch to disable the style checks?

 best! koert



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Koert:
Have you tried adding the following on your commandline ?

-Dscalastyle.failOnViolation=false

Cheers

On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Spark 1.2 feature freeze on November 1

2014-10-23 Thread Patrick Wendell
Hey All,

Just a reminder that as planned [1] we'll go into a feature freeze on
November 1. On that date I'll cut a 1.2 release branch and make the
up-or-down call on any patches that go into that branch, along with
individual committers.

It is common for us to receive a very large volume of patches near the
deadline. The highest priority will be fixes and features that are in
review and were submitted earlier in the window. As a heads up, new
feature patches that are submitted in the next week have a good chance
of being pushed after 1.2.

During this coming weeks, I'd like to invite the community to help
with code review, testing patches, helping isolate bugs, our test
infra, etc. In past releases, community participation has helped
increase our ability to merge patches substantially. Individuals
really can make a huge difference here!

[1] https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
Hey Ted,
i tried:
mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

no luck, still get
[ERROR] Failed to execute goal
org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
violation(s). - [Help 1]


On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: PR for Hierarchical Clustering Needs Review

2014-10-23 Thread Xiangrui Meng
Hi RJ,

We are close to the v1.2 feature freeze deadline, so I'm busy with the
pipeline feature and couple bugs. I will ask other developers to help
review the PR. Thanks for working with Yu and helping the code review!

Best,
Xiangrui

On Thu, Oct 23, 2014 at 2:58 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi all,

 A few months ago, I collected feedback on what the community was looking
 for in clustering methods.  A number of the community members requested a
 divisive hierarchical clustering method.

 Yu Ishikawa has stepped up to implement such a method.  I've been working
 with him to communicate what I heard the community request and to review
 and improve his code.

 You can find the JIRA here:
 https://issues.apache.org/jira/browse/SPARK-2429

 He has now submitted a PR:
 https://github.com/apache/spark/pull/2906

 I was hoping Xiangrui, other committers, and community members would be
 willing to take a look?  It's quite a large patch so it'll need extra
 attention.

 Thank you,
 RJ

 --
 em rnowl...@gmail.com
 c 954.496.2314

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Koert:
If you have time, you can try this diff - with which you would be able to
specify the following on the command line:
-Dscalastyle.failonviolation=false

diff --git a/pom.xml b/pom.xml
index 687cc63..108585e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -123,6 +123,7 @@
 log4j.version1.2.17/log4j.version
 hadoop.version1.0.4/hadoop.version
 protobuf.version2.4.1/protobuf.version
+scalastyle.failonviolationtrue/scalastyle.failonviolation
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.94.6/hbase.version
 flume.version1.4.0/flume.version
@@ -1071,7 +1072,7 @@
 version0.4.0/version
 configuration
   verbosefalse/verbose
-  failOnViolationtrue/failOnViolation
+  failOnViolation${scalastyle.failonviolation}/failOnViolation
   includeTestSourceDirectoryfalse/includeTestSourceDirectory
   failOnWarningfalse/failOnWarning
   sourceDirectory${basedir}/src/main/scala/sourceDirectory



On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote:

 Hey Ted,
 i tried:
 mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

 no luck, still get
 [ERROR] Failed to execute goal
 org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
 spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
 violation(s). - [Help 1]


 On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because
 of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors
 somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code
 that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
great thanks i will do that

On Thu, Oct 23, 2014 at 3:55 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 If you have time, you can try this diff - with which you would be able to
 specify the following on the command line:
 -Dscalastyle.failonviolation=false

 diff --git a/pom.xml b/pom.xml
 index 687cc63..108585e 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -123,6 +123,7 @@
  log4j.version1.2.17/log4j.version
  hadoop.version1.0.4/hadoop.version
  protobuf.version2.4.1/protobuf.version
 +scalastyle.failonviolationtrue/scalastyle.failonviolation
  yarn.version${hadoop.version}/yarn.version
  hbase.version0.94.6/hbase.version
  flume.version1.4.0/flume.version
 @@ -1071,7 +1072,7 @@
  version0.4.0/version
  configuration
verbosefalse/verbose
 -  failOnViolationtrue/failOnViolation
 +  failOnViolation${scalastyle.failonviolation}/failOnViolation
includeTestSourceDirectoryfalse/includeTestSourceDirectory
failOnWarningfalse/failOnWarning
sourceDirectory${basedir}/src/main/scala/sourceDirectory



 On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote:

 Hey Ted,
 i tried:
 mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

 no luck, still get
 [ERROR] Failed to execute goal
 org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
 spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
 violation(s). - [Help 1]


 On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was because
 of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors
 somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code
 that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to come
 back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org







Re: scalastyle annoys me a little bit

2014-10-23 Thread Ted Yu
Created SPARK-4066 and attached patch there.

On Thu, Oct 23, 2014 at 1:07 PM, Koert Kuipers ko...@tresata.com wrote:

 great thanks i will do that

 On Thu, Oct 23, 2014 at 3:55 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 If you have time, you can try this diff - with which you would be able to
 specify the following on the command line:
 -Dscalastyle.failonviolation=false

 diff --git a/pom.xml b/pom.xml
 index 687cc63..108585e 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -123,6 +123,7 @@
  log4j.version1.2.17/log4j.version
  hadoop.version1.0.4/hadoop.version
  protobuf.version2.4.1/protobuf.version
 +scalastyle.failonviolationtrue/scalastyle.failonviolation
  yarn.version${hadoop.version}/yarn.version
  hbase.version0.94.6/hbase.version
  flume.version1.4.0/flume.version
 @@ -1071,7 +1072,7 @@
  version0.4.0/version
  configuration
verbosefalse/verbose
 -  failOnViolationtrue/failOnViolation
 +
  failOnViolation${scalastyle.failonviolation}/failOnViolation
includeTestSourceDirectoryfalse/includeTestSourceDirectory
failOnWarningfalse/failOnWarning
sourceDirectory${basedir}/src/main/scala/sourceDirectory



 On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com
 wrote:

 Hey Ted,
 i tried:
 mvn clean package -DskipTests -Dscalastyle.failOnViolation=false

 no luck, still get
 [ERROR] Failed to execute goal
 org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project
 spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle
 violation(s). - [Help 1]


 On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Koert:
 Have you tried adding the following on your commandline ?

 -Dscalastyle.failOnViolation=false

 Cheers

 On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Koert,

 I think disabling the style checks in maven package could be a good
 idea for the reason you point out. I was sort of mixed on that when it
 was proposed for this exact reason. It's just annoying to developers.

 In terms of changing the global limit, this is more religion than
 anything else, but there are other cases where the current limit is
 useful (e.g. if you have many windows open in a large screen).

 - Patrick

 On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com
 wrote:
  100 max width seems very restrictive to me.
 
  even the most restrictive environment i have for development (ssh
 with
  emacs) i get a lot more characters to work with than that.
 
  personally i find the code harder to read, not easier. like i kept
  wondering why there are weird newlines in the
  middle of constructors and such, only to realise later it was
 because of
  the 100 character limit.
 
  also, i find mvn package erroring out because of style errors
 somewhat
  excessive. i understand that a pull request needs to conform to the
 style
  before being accepted, but this means i cant even run tests on code
 that
  does not conform to the style guide, which is a bit silly.
 
  i keep going out for coffee while package and tests run, only to
 come back
  for an annoying error that my line is 101 characters and therefore
 nothing
  ran.
 
  is there some maven switch to disable the style checks?
 
  best! koert

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org








label points with a given index

2014-10-23 Thread Lochana Menikarachchi


SparkConf conf = new 
SparkConf().setAppName(LogisticRegression).setMaster(local[4]);

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDDString lines = sc.textFile(some.csv);
JavaRDDLabeledPoint lPoints = lines.map(new CSVLineParser());

Is there anyway to parse an index to a function.. for example instead of 
hard coding (parts[0]) below is there a way to parse this




public class CSVLineParser implements FunctionString, LabeledPoint {
private static final Pattern COMMA = Pattern.compile(,);

@Override
public LabeledPoint call(String line) {
String[] parts = COMMA.split(line);
double y = Double.parseDouble(parts[0]);
double[] x = new double[parts.length];
for (int i = 1; i  parts.length; ++i) {
x[i] = Double.parseDouble(parts[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}
}

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: label points with a given index

2014-10-23 Thread Lochana Menikarachchi

Figured constructor can be used for this purpose..
On 10/24/14 7:57 AM, Lochana Menikarachchi wrote:


SparkConf conf = new 
SparkConf().setAppName(LogisticRegression).setMaster(local[4]);

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDDString lines = sc.textFile(some.csv);
JavaRDDLabeledPoint lPoints = lines.map(new CSVLineParser());

Is there anyway to parse an index to a function.. for example instead 
of hard coding (parts[0]) below is there a way to parse this




public class CSVLineParser implements FunctionString, LabeledPoint {
private static final Pattern COMMA = Pattern.compile(,);

@Override
public LabeledPoint call(String line) {
String[] parts = COMMA.split(line);
double y = Double.parseDouble(parts[0]);
double[] x = new double[parts.length];
for (int i = 1; i  parts.length; ++i) {
x[i] = Double.parseDouble(parts[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}
}



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
Ashwin,

I would say the strategies in general are:

1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.

2) Share a single spark context amongst multiple users, using fair
scheduler.  This is sort of like having a Hadoop resource pool.It
has some obvious HA/SPOF issues, namely that if the context dies then
every user using it is also dead.   Also, sharing RDDs in cached
memory has the same resiliency problems, namely that if any executor
dies then Spark must recompute / rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).

Job server can help with 1 or 2, 2 in particular.  If you have any
questions about job server, feel free to ask at the spark-jobserver
google group.   I am the maintainer.

-Evan


On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com wrote:
 You may want to take a look at 
 https://issues.apache.org/jira/browse/SPARK-3174.

 On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com 
 wrote:
 Upvote for the multitanency requirement.

 I'm also building a data analytic platform and there'll be multiple users
 running queries and computations simultaneously. One of the paint point is
 control of resource size. Users don't really know how much nodes they need,
 they always use as much as possible... The result is lots of wasted resource
 in our Yarn cluster.

 A way to 1) allow multiple spark context to share the same resource or 2)
 add dynamic resource management for Yarn mode is very much wanted.

 Jianshi

 On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
  use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
  got
  preempted say while doing reduceByKey, will the application progress
  with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org