Exception while running unit tests that makes use of local-cluster mode
Hi All, When i try to run unit tests that makes use of local-cluster mode (Ex: Accessing HttpBroadcast variables in a local cluster in BroadcastSuite.scala), its failing with the below exception. I'm using java version 1.8.0_05 and scala version 2.10. I tried to look into the jenkins build report and its passing over there. Please let me know how i can resolve this issue. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.43.112): java.lang.ClassNotFoundException: org.apache.spark.broadcast.BroadcastSuite$$anonfun$3$$anonfun$19 java.net.URLClassLoader$1.run(URLClassLoader.java:372) java.net.URLClassLoader$1.run(URLClassLoader.java:361) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:360) java.lang.ClassLoader.loadClass(ClassLoader.java:424) java.lang.ClassLoader.loadClass(ClassLoader.java:357) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:340) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.43.112): java.lang.ClassNotFoundException: org.apache.spark.broadcast.BroadcastSuite$$anonfun$3$$anonfun$19 java.net.URLClassLoader$1.run(URLClassLoader.java:372) java.net.URLClassLoader$1.run(URLClassLoader.java:361) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:360) java.lang.ClassLoader.loadClass(ClassLoader.java:424) java.lang.ClassLoader.loadClass(ClassLoader.java:357) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:340) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
Re: Multitenancy in Spark - within/across spark context
Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as possible... The result is lots of wasted resource in our Yarn cluster. A way to 1) allow multiple spark context to share the same resource or 2) add dynamic resource management for Yarn mode is very much wanted. Jianshi On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application section, it talks about multiuser and fair sharing within an app. How does multiuser within an application work(how users connect to an app,run their stuff) ? When would I want to use this ? I see. The way I read that page is that Spark supports all those scheduling options; but Spark doesn't give you the means to actually be able to submit jobs from different users to a running SparkContext hosted on a different process. For that, you'll need something like the job server that I referenced before, or write your own framework for supporting that. Personally, I'd use the information on that page when dealing with concurrent jobs in the same SparkContext, but still restricted to the same user. I'd avoid trying to create any application where a single SparkContext is trying to be shared by multiple users in any way. As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. I basically wanted to find out if there were any gotchas related to preemption on Spark. Things like say half of an application's executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? Jobs should still make progress as long as at least one executor is available. The gotcha would be the one I mentioned, where Spark will fail your job after x executors failed, which might be a common occurrence when preemption is enabled. That being said, it's a configurable option, so you can set x to a very large value and your job should keep on chugging along. The options you'd want to take a look at are: spark.task.maxFailures and spark.yarn.max.executor.failures -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
PR for Hierarchical Clustering Needs Review
Hi all, A few months ago, I collected feedback on what the community was looking for in clustering methods. A number of the community members requested a divisive hierarchical clustering method. Yu Ishikawa has stepped up to implement such a method. I've been working with him to communicate what I heard the community request and to review and improve his code. You can find the JIRA here: https://issues.apache.org/jira/browse/SPARK-2429 He has now submitted a PR: https://github.com/apache/spark/pull/2906 I was hoping Xiangrui, other committers, and community members would be willing to take a look? It's quite a large patch so it'll need extra attention. Thank you, RJ -- em rnowl...@gmail.com c 954.496.2314
Memory
Hi all, I would like to validate my understanding of memory regions in Spark. Any comments on my description below would be appreciated! Execution is split up into stages, based on wide dependencies between RDDs and actions such as save. All transformations involving narrow dependencies before this wide dependency (or action) are pipelined. When Spark uses HDFS, input data is loaded into memory according to the partitioning used in the HDFS. As Spark has three regions (general, shuffle and storage), and this does not yet involve an explicit cache nor a shuffle, I'll assume it goes into general. During the pipelined execution of transformations with narrow dependencies, it stays here, using the same partitioning, until we reach a wide dependency (or an action). It then acquires memory from the shuffle region, and spills to disk when there is no sufficient amount of memory available. The result is passed in an iterator (located in the general space) and the shuffle region is freed. Only when an RDD is explicitly cached, it moves from the general region into the storage region. This will guarantee availability for future use, but also save space from the general region. Question 1: Is this correct? Question 2: How big is the general region? Example: When I tell Spark I have 4GB, but my system actually has 16. I see that the parameters shuffle and storage are defined (default: 20% and 60%), but it does not seem that the general area is bounded by this. Will spark use: Shuffle: 0.2*4=0.8 Storage: 0.6*4=2.4 General: 1-(0.2+0.6)*4=0.8 Or Shuffle: max(0.2*4)=max(0.8) //As we acquire from a counter, and not actually divided memory regions Storage: max(0.6*4)=max(2.4) General: 16-actualUsage(shuffle+storage) or 4-actualUsage(shuffle+storage) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-tp8916.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: reading/writing parquet decimal type
Hi Matei, Another thing occurred to me. Will the binary format you're writing sort the data in numeric order? Or would the decimals have to be decoded for comparison? Cheers, Michael On Oct 12, 2014, at 10:48 PM, Matei Zaharia matei.zaha...@gmail.com wrote: The fixed-length binary type can hold fewer bytes than an int64, though many encodings of int64 can probably do the right thing. We can look into supporting multiple ways to do this -- the spec does say that you should at least be able to read int32s and int64s. Matei On Oct 12, 2014, at 8:20 PM, Michael Allman mich...@videoamp.com wrote: Hi Matei, Thanks, I can see you've been hard at work on this! I examined your patch and do have a question. It appears you're limiting the precision of decimals written to parquet to those that will fit in a long, yet you're writing the values as a parquet binary type. Why not write them using the int64 parquet type instead? Cheers, Michael On Oct 12, 2014, at 3:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals. Matei On Oct 12, 2014, at 1:51 PM, Michael Allman mich...@videoamp.com wrote: Hello, I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them. I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale. Thoughts? Michael - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Receiver/DStream storage level
I'm implementing a custom ReceiverInputDStream and I'm not sure how to initialize the Receiver with the storage level. The storage level is set on the DStream, but there doesn't seem to be a way to pass it to the Receiver. At the same time, setting the storage level separately on the Receiver seems to introduce potential confusion as the storage level of the DStream can be set separately. Is this desired behavior---to have distinct DStream and Receiver storage levels? Perhaps I'm missing something? Also, the storageLevel property of the Receiver[T] class is undocumented. Cheers, Michael - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multitenancy in Spark - within/across spark context
You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174. On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as possible... The result is lots of wasted resource in our Yarn cluster. A way to 1) allow multiple spark context to share the same resource or 2) add dynamic resource management for Yarn mode is very much wanted. Jianshi On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application section, it talks about multiuser and fair sharing within an app. How does multiuser within an application work(how users connect to an app,run their stuff) ? When would I want to use this ? I see. The way I read that page is that Spark supports all those scheduling options; but Spark doesn't give you the means to actually be able to submit jobs from different users to a running SparkContext hosted on a different process. For that, you'll need something like the job server that I referenced before, or write your own framework for supporting that. Personally, I'd use the information on that page when dealing with concurrent jobs in the same SparkContext, but still restricted to the same user. I'd avoid trying to create any application where a single SparkContext is trying to be shared by multiple users in any way. As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. I basically wanted to find out if there were any gotchas related to preemption on Spark. Things like say half of an application's executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? Jobs should still make progress as long as at least one executor is available. The gotcha would be the one I mentioned, where Spark will fail your job after x executors failed, which might be a common occurrence when preemption is enabled. That being said, it's a configurable option, so you can set x to a very large value and your job should keep on chugging along. The options you'd want to take a look at are: spark.task.maxFailures and spark.yarn.max.executor.failures -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
scalastyle annoys me a little bit
100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert
Re: scalastyle annoys me a little bit
Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
I know this is all very subjective, but I find long lines difficult to read. I also like how 100 characters fit in my editor setup fine (split wide screen), while a longer line length would mean I can't have two buffers side-by-side without horizontal scrollbars. I think it's fine to add a switch to skip the style tests, but then, you'll still have to fix the issues at some point... On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
Koert: Have you tried adding the following on your commandline ? -Dscalastyle.failOnViolation=false Cheers On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark 1.2 feature freeze on November 1
Hey All, Just a reminder that as planned [1] we'll go into a feature freeze on November 1. On that date I'll cut a 1.2 release branch and make the up-or-down call on any patches that go into that branch, along with individual committers. It is common for us to receive a very large volume of patches near the deadline. The highest priority will be fixes and features that are in review and were submitted earlier in the window. As a heads up, new feature patches that are submitted in the next week have a good chance of being pushed after 1.2. During this coming weeks, I'd like to invite the community to help with code review, testing patches, helping isolate bugs, our test infra, etc. In past releases, community participation has helped increase our ability to merge patches substantially. Individuals really can make a huge difference here! [1] https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
Hey Ted, i tried: mvn clean package -DskipTests -Dscalastyle.failOnViolation=false no luck, still get [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle violation(s). - [Help 1] On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: Have you tried adding the following on your commandline ? -Dscalastyle.failOnViolation=false Cheers On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: PR for Hierarchical Clustering Needs Review
Hi RJ, We are close to the v1.2 feature freeze deadline, so I'm busy with the pipeline feature and couple bugs. I will ask other developers to help review the PR. Thanks for working with Yu and helping the code review! Best, Xiangrui On Thu, Oct 23, 2014 at 2:58 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, A few months ago, I collected feedback on what the community was looking for in clustering methods. A number of the community members requested a divisive hierarchical clustering method. Yu Ishikawa has stepped up to implement such a method. I've been working with him to communicate what I heard the community request and to review and improve his code. You can find the JIRA here: https://issues.apache.org/jira/browse/SPARK-2429 He has now submitted a PR: https://github.com/apache/spark/pull/2906 I was hoping Xiangrui, other committers, and community members would be willing to take a look? It's quite a large patch so it'll need extra attention. Thank you, RJ -- em rnowl...@gmail.com c 954.496.2314 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
Koert: If you have time, you can try this diff - with which you would be able to specify the following on the command line: -Dscalastyle.failonviolation=false diff --git a/pom.xml b/pom.xml index 687cc63..108585e 100644 --- a/pom.xml +++ b/pom.xml @@ -123,6 +123,7 @@ log4j.version1.2.17/log4j.version hadoop.version1.0.4/hadoop.version protobuf.version2.4.1/protobuf.version +scalastyle.failonviolationtrue/scalastyle.failonviolation yarn.version${hadoop.version}/yarn.version hbase.version0.94.6/hbase.version flume.version1.4.0/flume.version @@ -1071,7 +1072,7 @@ version0.4.0/version configuration verbosefalse/verbose - failOnViolationtrue/failOnViolation + failOnViolation${scalastyle.failonviolation}/failOnViolation includeTestSourceDirectoryfalse/includeTestSourceDirectory failOnWarningfalse/failOnWarning sourceDirectory${basedir}/src/main/scala/sourceDirectory On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote: Hey Ted, i tried: mvn clean package -DskipTests -Dscalastyle.failOnViolation=false no luck, still get [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle violation(s). - [Help 1] On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: Have you tried adding the following on your commandline ? -Dscalastyle.failOnViolation=false Cheers On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
great thanks i will do that On Thu, Oct 23, 2014 at 3:55 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: If you have time, you can try this diff - with which you would be able to specify the following on the command line: -Dscalastyle.failonviolation=false diff --git a/pom.xml b/pom.xml index 687cc63..108585e 100644 --- a/pom.xml +++ b/pom.xml @@ -123,6 +123,7 @@ log4j.version1.2.17/log4j.version hadoop.version1.0.4/hadoop.version protobuf.version2.4.1/protobuf.version +scalastyle.failonviolationtrue/scalastyle.failonviolation yarn.version${hadoop.version}/yarn.version hbase.version0.94.6/hbase.version flume.version1.4.0/flume.version @@ -1071,7 +1072,7 @@ version0.4.0/version configuration verbosefalse/verbose - failOnViolationtrue/failOnViolation + failOnViolation${scalastyle.failonviolation}/failOnViolation includeTestSourceDirectoryfalse/includeTestSourceDirectory failOnWarningfalse/failOnWarning sourceDirectory${basedir}/src/main/scala/sourceDirectory On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote: Hey Ted, i tried: mvn clean package -DskipTests -Dscalastyle.failOnViolation=false no luck, still get [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle violation(s). - [Help 1] On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: Have you tried adding the following on your commandline ? -Dscalastyle.failOnViolation=false Cheers On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: scalastyle annoys me a little bit
Created SPARK-4066 and attached patch there. On Thu, Oct 23, 2014 at 1:07 PM, Koert Kuipers ko...@tresata.com wrote: great thanks i will do that On Thu, Oct 23, 2014 at 3:55 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: If you have time, you can try this diff - with which you would be able to specify the following on the command line: -Dscalastyle.failonviolation=false diff --git a/pom.xml b/pom.xml index 687cc63..108585e 100644 --- a/pom.xml +++ b/pom.xml @@ -123,6 +123,7 @@ log4j.version1.2.17/log4j.version hadoop.version1.0.4/hadoop.version protobuf.version2.4.1/protobuf.version +scalastyle.failonviolationtrue/scalastyle.failonviolation yarn.version${hadoop.version}/yarn.version hbase.version0.94.6/hbase.version flume.version1.4.0/flume.version @@ -1071,7 +1072,7 @@ version0.4.0/version configuration verbosefalse/verbose - failOnViolationtrue/failOnViolation + failOnViolation${scalastyle.failonviolation}/failOnViolation includeTestSourceDirectoryfalse/includeTestSourceDirectory failOnWarningfalse/failOnWarning sourceDirectory${basedir}/src/main/scala/sourceDirectory On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote: Hey Ted, i tried: mvn clean package -DskipTests -Dscalastyle.failOnViolation=false no luck, still get [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-core_2.10: Failed during scalastyle execution: You have 3 Scalastyle violation(s). - [Help 1] On Thu, Oct 23, 2014 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Koert: Have you tried adding the following on your commandline ? -Dscalastyle.failOnViolation=false Cheers On Thu, Oct 23, 2014 at 11:07 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Koert, I think disabling the style checks in maven package could be a good idea for the reason you point out. I was sort of mixed on that when it was proposed for this exact reason. It's just annoying to developers. In terms of changing the global limit, this is more religion than anything else, but there are other cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of constructors and such, only to realise later it was because of the 100 character limit. also, i find mvn package erroring out because of style errors somewhat excessive. i understand that a pull request needs to conform to the style before being accepted, but this means i cant even run tests on code that does not conform to the style guide, which is a bit silly. i keep going out for coffee while package and tests run, only to come back for an annoying error that my line is 101 characters and therefore nothing ran. is there some maven switch to disable the style checks? best! koert - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
label points with a given index
SparkConf conf = new SparkConf().setAppName(LogisticRegression).setMaster(local[4]); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDString lines = sc.textFile(some.csv); JavaRDDLabeledPoint lPoints = lines.map(new CSVLineParser()); Is there anyway to parse an index to a function.. for example instead of hard coding (parts[0]) below is there a way to parse this public class CSVLineParser implements FunctionString, LabeledPoint { private static final Pattern COMMA = Pattern.compile(,); @Override public LabeledPoint call(String line) { String[] parts = COMMA.split(line); double y = Double.parseDouble(parts[0]); double[] x = new double[parts.length]; for (int i = 1; i parts.length; ++i) { x[i] = Double.parseDouble(parts[i]); } return new LabeledPoint(y, Vectors.dense(x)); } } - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: label points with a given index
Figured constructor can be used for this purpose.. On 10/24/14 7:57 AM, Lochana Menikarachchi wrote: SparkConf conf = new SparkConf().setAppName(LogisticRegression).setMaster(local[4]); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDString lines = sc.textFile(some.csv); JavaRDDLabeledPoint lPoints = lines.map(new CSVLineParser()); Is there anyway to parse an index to a function.. for example instead of hard coding (parts[0]) below is there a way to parse this public class CSVLineParser implements FunctionString, LabeledPoint { private static final Pattern COMMA = Pattern.compile(,); @Override public LabeledPoint call(String line) { String[] parts = COMMA.split(line); double y = Double.parseDouble(parts[0]); double[] x = new double[parts.length]; for (int i = 1; i parts.length; ++i) { x[i] = Double.parseDouble(parts[i]); } return new LabeledPoint(y, Vectors.dense(x)); } } - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multitenancy in Spark - within/across spark context
Ashwin, I would say the strategies in general are: 1) Have each user submit separate Spark app (each its own Spark Context), with its own resource settings, and share data through HDFS or something like Tachyon for speed. 2) Share a single spark context amongst multiple users, using fair scheduler. This is sort of like having a Hadoop resource pool.It has some obvious HA/SPOF issues, namely that if the context dies then every user using it is also dead. Also, sharing RDDs in cached memory has the same resiliency problems, namely that if any executor dies then Spark must recompute / rebuild the RDD (it tries to only rebuild the missing part, but sometimes it must rebuild everything). Job server can help with 1 or 2, 2 in particular. If you have any questions about job server, feel free to ask at the spark-jobserver google group. I am the maintainer. -Evan On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com wrote: You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174. On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as possible... The result is lots of wasted resource in our Yarn cluster. A way to 1) allow multiple spark context to share the same resource or 2) add dynamic resource management for Yarn mode is very much wanted. Jianshi On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application section, it talks about multiuser and fair sharing within an app. How does multiuser within an application work(how users connect to an app,run their stuff) ? When would I want to use this ? I see. The way I read that page is that Spark supports all those scheduling options; but Spark doesn't give you the means to actually be able to submit jobs from different users to a running SparkContext hosted on a different process. For that, you'll need something like the job server that I referenced before, or write your own framework for supporting that. Personally, I'd use the information on that page when dealing with concurrent jobs in the same SparkContext, but still restricted to the same user. I'd avoid trying to create any application where a single SparkContext is trying to be shared by multiple users in any way. As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. I basically wanted to find out if there were any gotchas related to preemption on Spark. Things like say half of an application's executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? Jobs should still make progress as long as at least one executor is available. The gotcha would be the one I mentioned, where Spark will fail your job after x executors failed, which might be a common occurrence when preemption is enabled. That being said, it's a configurable option, so you can set x to a very large value and your job should keep on chugging along. The options you'd want to take a look at are: spark.task.maxFailures and spark.yarn.max.executor.failures -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org