How to change the values in Array of Bytes
Hi, I have an array of bytes and I have filled the array with 0 in all the postitions. *var Array = Array.fill[Byte](10)(0)* Now, if certain conditions are satisfied, I want to change some elements of the array to 1 instead of 0. If I run, *if (Array.apply(index)==0) Array.apply(index) = 1* it returns me an error. But if I assign *Array.apply(index) *to a variable and do the same thing then it works. I do not want to assign this to variables because if I do this, I would be creating a lot of variables. Can anyone tell me a method to do this? Thank You
Re: Support R in Spark
Cool! It is a very good news. Can’t wait for it. Kui On Sep 5, 2014, at 1:58 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error: type mismatch while Union
I am using Spark version 1.0.2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to change the values in Array of Bytes
More of a Scala question than Spark, but apply here can be written with just parentheses like this: val array = Array.fill[Byte](10)(0) if (array(index) == 0) { array(index) = 1 } The second instance of array(index) = 1 is actually not calling apply, but update. It's a scala-ism that's usually invisible. The above is equivalent to: if (array.apply(index) == 0) { array.update(index, 1) } On Sat, Sep 6, 2014 at 2:09 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have an array of bytes and I have filled the array with 0 in all the postitions. *var Array = Array.fill[Byte](10)(0)* Now, if certain conditions are satisfied, I want to change some elements of the array to 1 instead of 0. If I run, *if (Array.apply(index)==0) Array.apply(index) = 1* it returns me an error. But if I assign *Array.apply(index) *to a variable and do the same thing then it works. I do not want to assign this to variables because if I do this, I would be creating a lot of variables. Can anyone tell me a method to do this? Thank You
Re: error: type mismatch while Union
Are you doing this from the spark-shell? You're probably running into https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in 1.1. On Sat, Sep 6, 2014 at 3:03 AM, Dhimant dhimant84.jays...@gmail.com wrote: I am using Spark version 1.0.2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: question on replicate() in blockManager.scala
Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if check, perhaps obscuring its existence. On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, var cachedPeers: Seq[BlockManagerId] = null private def replicate(blockId: String, data: ByteBuffer, level: StorageLevel) { val tLevel = StorageLevel(level.useDisk, level.useMemory, level.deserialized, 1) if (cachedPeers == null) { cachedPeers = master.getPeers(blockManagerId, level.replication - 1) } for (peer: BlockManagerId - cachedPeers) { val start = System.nanoTime data.rewind() logDebug(Try to replicate BlockId + blockId + once; The size of the data is + data.limit() + Bytes. To node: + peer) if (!BlockManagerWorker.syncPutBlock(PutBlock(blockId, data, tLevel), new ConnectionManagerId(peer.host, peer.port))) { logError(Failed to call syncPutBlock to + peer) } logDebug(Replicated BlockId + blockId + once used + (System.nanoTime - start) / 1e6 + s; The size of the data is + data.limit() + bytes.) } I get the flow of this code. But, I dont find any method being called for actually writing the data into the set of peers chosen for replication. Where exaclty is the replication happening? Thank you!! -Karthik
Re: Task not serializable
I disagree that the generally right change is to try to make the classes serializable. Usually, classes that are not serializable are not supposed to be serialized. You're using them in a way that's causing them to be serialized, and that's probably not desired. For example, this is wrong: val foo: SomeUnserializableManagerClass = ... rdd.map(d = foo.bar(d)) This is right: rdd.map { d = val foo: SomeUnserializableManagerClass = ... foo.bar(d) } In the first instance, you create the object on the driver and try to serialize and copy it to workers. In the second, you're creating SomeUnserializableManagerClass in the function and therefore on the worker. mapPartitions is better if this creation is expensive. On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi, I'm trying to migrate a map-reduce program to work with spark. I migrated the program from Java to Scala. The map-reduce program basically loads a HDFS file and for each line in the file it applies several transformation functions available in various external libraries. When I execute this over spark, it is throwing me Task not serializable exceptions for each and every class being used from these from external libraries. I included serialization to few classes which are in my scope, but there there are several other classes which are out of my scope like org.apache.hadoop.io.Text. How to overcome these exceptions? ~Sarath. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting the type of an RDD in spark AND pyspark
Pretty easy to do in Scala: rdd.elementClassTag.runtimeClass You can access this method from Python as well by using the internal _jrdd. It would look something like this (warning, I have not tested it): rdd._jrdd.classTag().runtimeClass() (The method name is classTag for JavaRDDLike, and elementClassTag for Scala's RDD.) On Thu, Sep 4, 2014 at 1:32 PM, esamanas evan.sama...@gmail.com wrote: Hi, I'm new to spark and scala, so apologies if this is obvious. Every RDD appears to be typed, which I can see by seeing the output in the spark-shell when I execute 'take': scala val t = sc.parallelize(Array(1,2,3)) t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at console:12 scala t.take(3) res4: Array[Int] = Array(1, 2, 3) scala val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3)) u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize at console:12 scala u.take(3) res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3) Array type stays the same even if only one type returned. scala u.take(1) res6: Array[Any] = Array(1) Is there some way to just get the name of the type of the entire RDD from some function call? Also, I would really like this same functionality in pyspark, so I'm wondering if that exists on that side, since clearly the underlying RDD is typed (I'd be fine with either the Scala or Python type name). Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
unsubscribe
Re: unsubscribe
Unsubscribe On Sep 6, 2014, at 7:48 AM, Murali Raju murali.r...@infrastacks.com wrote:
Re: Support R in Spark
Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other familiar data mining/machine learning idioms, without having to know about the distributed representation underneath. Now, you can get to the underlying RDDs if you want to, simply by asking for it. This was launched at the July Spark Summit. See http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us . Sent while mobile. Please excuse typos etc. On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Task not serializable
Thanks Alok, Sean. As suggested by Sean, I tried a sample program. I wrote a function in which I made a reference to a class from third party library that is not serialized and passed it to my map function. On executing I got same exception. Then I modified the program removed function and written it's contents as anonymous function inside map function. This time the execution succeeded. I understood the explanation of Sean. But request for references to a more detailed explanation and examples for writing efficient spark programs avoiding such pitfalls. ~Sarath On 06-Sep-2014 4:32 pm, Sean Owen so...@cloudera.com wrote: I disagree that the generally right change is to try to make the classes serializable. Usually, classes that are not serializable are not supposed to be serialized. You're using them in a way that's causing them to be serialized, and that's probably not desired. For example, this is wrong: val foo: SomeUnserializableManagerClass = ... rdd.map(d = foo.bar(d)) This is right: rdd.map { d = val foo: SomeUnserializableManagerClass = ... foo.bar(d) } In the first instance, you create the object on the driver and try to serialize and copy it to workers. In the second, you're creating SomeUnserializableManagerClass in the function and therefore on the worker. mapPartitions is better if this creation is expensive. On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi, I'm trying to migrate a map-reduce program to work with spark. I migrated the program from Java to Scala. The map-reduce program basically loads a HDFS file and for each line in the file it applies several transformation functions available in various external libraries. When I execute this over spark, it is throwing me Task not serializable exceptions for each and every class being used from these from external libraries. I included serialization to few classes which are in my scope, but there there are several other classes which are out of my scope like org.apache.hadoop.io.Text. How to overcome these exceptions? ~Sarath.
Re: How spark parallelize maps Slices to tasks/executors/workers
On 09/04/2014 09:55 PM, Mozumder, Monir wrote: I have this 2-node cluster setup, where each node has 4-cores. MASTER (Worker-on-master) (Worker-on-node1) (slaves(master,node1)) SPARK_WORKER_INSTANCES=1 I am trying to understand Spark's parallelize behavior. The sparkPi example has this code: val slices = 8 val n = 10 * slices val count = spark.parallelize(1 to n, slices).map { i = val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) As per documentation: Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. I set slices to be 8 which means the workingset will be divided among 8 tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core) Questions: i) Where can I see task level details? Inside executors I dont see task breakdown so I can see the effect of slices on the UI. under http://localhost:4040/stages/ you can drill into individual stages to see task details ii) How to programmatically find the working set size for the map function above? I assume it is n/slices (10 above) it'll be roughly n/slices. you can mapPqrtitions() and check their length iii) Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads? parallel. have a look at https://spark.apache.org/docs/latest/cluster-overview.html iv) Reasoning behind 2-4 slices per CPU. typically things like 2-4 slices per CPU are general rules of thumb because tasks are more io bound than not. depending on your workload this might change. it's probably one of the last things you'll want to optimize, first being the transformation ordering in your dag. v) I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of Bests, -Monir best, matt - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL check if query is completed (pyspark)
Hi, I am using Spark SQL to run some administrative queries and joins (e.g. create table, insert overwrite, etc), where the query does not return any data. I noticed if the query fails it prints some error message on the console, but does not actually throw an exception (this is spark 1.0.2). Is there any way to get these errors from the returned object? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: unsubscribe
To unsubscribe send an email to user-unsubscr...@spark.apache.org Links to sub/unsub are here: https://spark.apache.org/community.html On Sat, Sep 6, 2014 at 7:52 AM, Derek Schoettle dscho...@us.ibm.com wrote: Unsubscribe On Sep 6, 2014, at 7:48 AM, Murali Raju murali.r...@infrastacks.com wrote:
Re: Support R in Spark
Thanks, Christopher. I saw it before, it is amazing. Last time I try to download it from adatao, but no response after filling the table. How can I download it or its source code? What is the license? Kui On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote: Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other familiar data mining/machine learning idioms, without having to know about the distributed representation underneath. Now, you can get to the underlying RDDs if you want to, simply by asking for it. This was launched at the July Spark Summit. See http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us . Sent while mobile. Please excuse typos etc. On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL check if query is completed (pyspark)
The SQLContext.sql() will return an SchemaRDD, you need to call collect() to pull the data in. On Sat, Sep 6, 2014 at 6:02 AM, jamborta jambo...@gmail.com wrote: Hi, I am using Spark SQL to run some administrative queries and joins (e.g. create table, insert overwrite, etc), where the query does not return any data. I noticed if the query fails it prints some error message on the console, but does not actually throw an exception (this is spark 1.0.2). Is there any way to get these errors from the returned object? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting the type of an RDD in spark AND pyspark
But you can not get what you expected in PySpark, because the RDD in Scala is serialized, so it will always be RDD[Array[Byte]], whatever the type of RDD in Python is. Davies On Sat, Sep 6, 2014 at 4:09 AM, Aaron Davidson ilike...@gmail.com wrote: Pretty easy to do in Scala: rdd.elementClassTag.runtimeClass You can access this method from Python as well by using the internal _jrdd. It would look something like this (warning, I have not tested it): rdd._jrdd.classTag().runtimeClass() (The method name is classTag for JavaRDDLike, and elementClassTag for Scala's RDD.) On Thu, Sep 4, 2014 at 1:32 PM, esamanas evan.sama...@gmail.com wrote: Hi, I'm new to spark and scala, so apologies if this is obvious. Every RDD appears to be typed, which I can see by seeing the output in the spark-shell when I execute 'take': scala val t = sc.parallelize(Array(1,2,3)) t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at console:12 scala t.take(3) res4: Array[Int] = Array(1, 2, 3) scala val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3)) u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize at console:12 scala u.take(3) res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3) Array type stays the same even if only one type returned. scala u.take(1) res6: Array[Any] = Array(1) Is there some way to just get the name of the type of the entire RDD from some function call? Also, I would really like this same functionality in pyspark, so I'm wondering if that exists on that side, since clearly the underlying RDD is typed (I'd be fine with either the Scala or Python type name). Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Q: About scenarios where driver execution flow may block...
Hello friends: I have a theory question about call blocking in a Spark driver. Consider this (admittedly contrived =:)) snippet to illustrate this question... x = rdd01.reduceByKey() # or maybe some other 'shuffle-requiring action'. b = sc.broadcast(x. take(20)) # Or any statement that requires the previous statement to complete, cluster-wide. y = rdd02.someAction(f(b)) Would the first or second statement above block because the second (or third) statement needs to wait for the previous one to complete, cluster-wide? Maybe this isn't the best example (typed on a phone), but generally I'm trying to understand the scenario(s) where a rdd call in the driver may block because the graph indicates that the next statement is dependent on the completion of the current one, cluster-wide (noy just lazy evaluated). Thank you. :) Sincerely yours, Team Dimension Data
Re: Is there any way to control the parallelism in LogisticRegression
Yes. But you need to store RDD as *serialized* Java objects. See the session of storage level http://spark.apache.org/docs/latest/programming-guide.html Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Sep 4, 2014 at 8:06 PM, Jiusheng Chen chenjiush...@gmail.com wrote: Thanks DB. Did you mean this? spark.rdd.compress true On Thu, Sep 4, 2014 at 2:48 PM, DB Tsai dbt...@dbtsai.com wrote: For saving the memory, I recommend you compress the cached RDD, and it will be couple times smaller than original data sets. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3, 2014 at 9:28 PM, Jiusheng Chen chenjiush...@gmail.com wrote: Thanks DB and Xiangrui. Glad to know you guys are actively working on it. Another thing, did we evaluate the loss of using Float to store values? currently it is Double. Use fewer bits has the benifit of memory footprint reduction. According to Google, they even uses 16 bits (a special encoding scheme called q2.13) http://jmlr.org/proceedings/papers/v28/golovin13.pdf in their learner without measurable loss, and can save 75% memory. On Thu, Sep 4, 2014 at 11:02 AM, DB Tsai dbt...@dbtsai.com wrote: With David's help today, we were able to implement elastic net glm in Spark. It's surprising easy, and with just some modification in breeze's OWLQN code, it just works without further investigation. We did benchmark, and the coefficients are within 0.5% differences compared with R's glmnet package. I guess this is first truly distributed glmnet implementation. Still require some effort to have it in mllib; mostly api cleanup work. 1) I'll submit a PR to breeze which implements weighted regularization in OWLQN. 2) This also depends on https://issues.apache.org/jira/browse/SPARK-2505 which we have internal version requiring some cleanup for open source project. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote: +DB David (They implemented QWLQN on Spark today.) On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com wrote: Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1 regularization by using L1Updater http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater will not work since the soft-thresholding logic in L1Updater is designed for gradient descent. Since the algorithm comes from Breeze and I noticed Breeze actually supports L1 (OWLQN http://www.scalanlp.org/api/breeze/#breeze.optimize.OWLQN). Wondering if there is some special considerations that current MLLib didn't support OWLQN? And any plan to add it? Thanks for your time! On Fri, Aug 22, 2014 at 12:56 PM, ZHENG, Xu-dong dong...@gmail.com wrote: Update. I just find a magic parameter *blanceSlack* in *CoalescedRDD*, which sounds could control the locality. The default value is 0.1 (smaller value means lower locality). I change it to 1.0 (full locality) and use #3 approach, then find a lot improvement (20%~40%). Although the Web UI still shows the type of task as 'ANY' and the input is from shuffle read, but the real performance is much better before change this parameter. [image: Inline image 1] I think the benefit includes: 1. This approach keep the physical partition size small, but make each task handle multiple partitions. So the memory requested for deserialization is reduced, which also reduce the GC time. That is exactly what we observed in our job. 2. This approach will not hit the 2G limitation, because it not change the partition size. And I also think that, Spark may change this default value, or at least expose this parameter to users (*CoalescedRDD *is a private class, and *RDD*.*coalesce* also don't have a parameter to control that). On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com wrote: Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger numPartitions to avoid hitting integer bound or 2G limitation, at the cost of increased shuffle size per iteration. If you use a CombineInputFormat and then cache, it will try to give you roughly the same size per partition. There will be some remote fetches from HDFS but still cheaper than calling RDD.repartition(). For coalesce without shuffle, I don't know how to set the right number of partitions either ... -Xiangrui On Tue, Aug 12, 2014 at 6:16 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi Xiangrui, Thanks for your reply! Yes,
Re: Support R in Spark
Hi Kui, sorry about that. That link you mentioned is probably the one for the products. We don't have one pointing from adatao.com to ddf.io; maybe we'll add it. As for access to the code base itself, I think the team has already created a GitHub repo for it, and should open it up within a few weeks. There's some debate about whether to put out the implementation with Shark dependencies now, or SparkSQL with a bit limited functionality and not as well tested. I'll check and ping when this is opened up. The license is Apache. Sent while mobile. Please excuse typos etc. On Sep 6, 2014 1:39 PM, oppokui oppo...@gmail.com wrote: Thanks, Christopher. I saw it before, it is amazing. Last time I try to download it from adatao, but no response after filling the table. How can I download it or its source code? What is the license? Kui On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote: Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other familiar data mining/machine learning idioms, without having to know about the distributed representation underneath. Now, you can get to the underlying RDDs if you want to, simply by asking for it. This was launched at the July Spark Summit. See http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us . Sent while mobile. Please excuse typos etc. On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: prepending jars to the driver class path for spark-submit on YARN
I ran into the same issue. What I did was use maven shade plugin to shade my version of httpcomponents libraries into another package. On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza pesp...@societyconsulting.com wrote: Hey - I’m struggling with some dependency issues with org.apache.httpcomponents httpcore and httpclient when using spark-submit with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster. I’ve seen several posts about this issue, but no resolution. The error message is this: Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:85) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:93) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:155) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:118) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:102) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:332) at com.oncue.rna.realtime.streaming.config.package$.transferManager(package.scala:76) at com.oncue.rna.realtime.streaming.models.S3SchemaRegistry.init(SchemaRegistry.scala:27) at com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry$lzycompute(SchemaRegistry.scala:46) at com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry(SchemaRegistry.scala:44) at com.oncue.rna.realtime.streaming.coders.KafkaAvroDecoder.init(KafkaAvroDecoder.scala:20) ... 17 more The apache httpcomponents libraries include the method above as of version 4.2. The Spark 1.0.2 binaries seem to include version 4.1. I can get this to work in my driver program by adding exclusions to force use of 4.1, but then I get the error in tasks even when using the —jars option of the spark-submit command. How can I get both the driver program and the individual tasks in my spark-streaming job to use the same version of this library so my job will run all the way through? thanks p