RE: spark slave cannot execute without admin permission on windows
+ dev mailing list If this is supposed to work, is there a regression then? The spark core code shows the permission for copied file to \work is set to a+x at Line 442 of Utils.scalahttps://github.com/apache/spark/blob/b271c265b742fa6947522eda4592e9e6a7fd1f3a/core/src/main/scala/org/apache/spark/util/Utils.scala . The example jar I used had all permissions including Read Execute prior spark-submit: [cid:image001.png@01D04BDA.A74C65E0] However after copied to worker node’s \work folder, only limited permission left on the jar with no execution right. [cid:image002.png@01D04BDA.A74C65E0] From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, February 18, 2015 10:40 PM To: Judy Nash Cc: u...@spark.apache.org Subject: Re: spark slave cannot execute without admin permission on windows You need not require admin permission, but just make sure all those jars has execute permission ( read/write access) Thanks Best Regards On Thu, Feb 19, 2015 at 11:30 AM, Judy Nash judyn...@exchange.microsoft.commailto:judyn...@exchange.microsoft.com wrote: Hi, Is it possible to configure spark to run without admin permission on windows? My current setup run master slave successfully with admin permission. However, if I downgrade permission level from admin to user, SparkPi fails with the following exception on the slave node: Exception in thread main org.apache.spark.SparkException: Job aborted due to s tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, workernode0.jnashsparkcurr2.d10.internal.cloudapp.nethttp://workernode0.jnashsparkcurr2.d10.internal.cloudapp.net) : java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) Upon investigation, it appears that sparkPi jar under spark_home\worker\appname\*.jar does not have execute permission set, causing spark not able to find class. Advice would be very much appreciated. Thanks, Judy
Re: Replacing Jetty with TomCat
Hi Sean, The issue we have here is that all our products are based on a single platform and we try to make all our products coherent with our platform as much as possible. so, having two web services in one instance would not be a very elegant solution. That is why we were seeking a way to switch it to Tomcat. But as I understand, it is not readily supported, hence we will have to accept it as it is. If we are not using the Spark UIs, is it possible to disable the UIs and prevent the jetty server from starting, but yet use the core spark functionality? Hi Corey, thank you for your ideas. Our biggest concern here was that it starts a new webserver inside spark. opening up new ports etc. might be seen as security threats when it comes to commercial distributions. cheers On Wed, Feb 18, 2015 at 3:25 PM, Sean Owen so...@cloudera.com wrote: I do not think it makes sense to make the web server configurable. Mostly because there's no real problem in running an HTTP service internally based on Netty while you run your own HTTP service based on something else like Tomcat. What's the problem? On Wed, Feb 18, 2015 at 3:14 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The main issue we have is, running two web servers in a single product. we think it would not be an elegant solution. Could you please point me to the main areas where jetty server is tightly coupled or extension points where I could plug tomcat instead of jetty? If successful I could contribute it to the spark project. :-) cheers On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote: There's no particular reason you have to remove the embedded Jetty server, right? it doesn't prevent you from using it inside another app that happens to run in Tomcat. You won't be able to switch it out without rewriting a fair bit of code, no, but you don't need to. On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of integrating Spark server inside a product. Our current product uses Tomcat as its webserver. Is it possible to switch the Jetty webserver in Spark to Tomcat off-the-shelf? Cheers -- Niranda -- Niranda -- Niranda
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
P.S: For some reason replacing import sqlContext.createSchemaRDD with import sqlContext.implicits._ doesn't do the implicit conversations. registerTempTable gives syntax error. I will dig deeper tomorrow. Has anyone seen this ? We will write up a whole migration guide before the final release, but I can quickly explain this one. We made the implicit conversion significantly less broad to avoid the chance of confusing conflicts. However, now you have to call .toDF in order to force RDDs to become DataFrames.
Re: Hive SKEWED feature supported in Spark SQL ?
1) is SKEWED BY honored ? If so, has anyone run into directories not being created ? It is not. 2) if it is not honored, does it matter ? Hive introduced this feature to better handle joins where tables had a skewed distribution on keys joined on so that the single mapper handling one of the keys didn't hold up the whole process. Could that happen in Spark / Spark SQL? It could matter for very skewed data, though I have not heard many complaints. We could consider adding it in the future if people are having problems with skewed data.
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Excellent. Explicit toDF() works. a) employees.toDF().registerTempTable(Employees) - works b) Also affects saveAsParquetFile - orders.toDF().saveAsParquetFile Adding to my earlier tests: 4.0 SQL from Scala and Python 4.1 result = sqlContext.sql(SELECT * from Employees WHERE State = 'WA') OK 4.2 result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK 4.3 result = sqlContext.sql(SELECT ShipCountry, Sum(OrderDetails.UnitPrice * Qty * Discount) AS ProductSales FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID GROUP BY ShipCountry) OK 4.4 saveAsParquetFile OK 4.5 Read and verify the 4.4 save - sqlContext.parquetFile, registerTempTable, sql OK Cheers thanks Michael k/ On Thu, Feb 19, 2015 at 12:02 PM, Michael Armbrust mich...@databricks.com wrote: P.S: For some reason replacing import sqlContext.createSchemaRDD with import sqlContext.implicits._ doesn't do the implicit conversations. registerTempTable gives syntax error. I will dig deeper tomorrow. Has anyone seen this ? We will write up a whole migration guide before the final release, but I can quickly explain this one. We made the implicit conversion significantly less broad to avoid the chance of confusing conflicts. However, now you have to call .toDF in order to force RDDs to become DataFrames.
Re: Replacing Jetty with TomCat
To add to Sean and Reynold's point: Please correct me if I'm wrong, but Spark depends on hadoop-common which also uses jetty in the HttpServer2 code. So even if you remove jetty from Spark by making it an optional dependency, it will be pulled in by Hadoop. So you'll still see that your program that depends on hypothetical Spark-Tomcat will still pull in jetty jars. -Ewan On 19/02/15 10:23, Sean Owen wrote: Sure, but you are not using Netty at all. It's invisible to you. It's not as if you have to set up and maintain a Jetty container. I don't think your single platform for your apps is relevant. You can turn off the UI, but as Reynold said, the HTTP servers are also part of the core data transport functionality and you can't turn that off. It's not merely unsupported to swap this out with an arbitrary container, it's not clear it would work with Tomcat without re-integrating with its behavior and tuning. But it also shouldn't matter to anyone. On Thu, Feb 19, 2015 at 8:11 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The issue we have here is that all our products are based on a single platform and we try to make all our products coherent with our platform as much as possible. so, having two web services in one instance would not be a very elegant solution. That is why we were seeking a way to switch it to Tomcat. But as I understand, it is not readily supported, hence we will have to accept it as it is. If we are not using the Spark UIs, is it possible to disable the UIs and prevent the jetty server from starting, but yet use the core spark functionality? Hi Corey, thank you for your ideas. Our biggest concern here was that it starts a new webserver inside spark. opening up new ports etc. might be seen as security threats when it comes to commercial distributions. cheers On Wed, Feb 18, 2015 at 3:25 PM, Sean Owen so...@cloudera.com wrote: I do not think it makes sense to make the web server configurable. Mostly because there's no real problem in running an HTTP service internally based on Netty while you run your own HTTP service based on something else like Tomcat. What's the problem? On Wed, Feb 18, 2015 at 3:14 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The main issue we have is, running two web servers in a single product. we think it would not be an elegant solution. Could you please point me to the main areas where jetty server is tightly coupled or extension points where I could plug tomcat instead of jetty? If successful I could contribute it to the spark project. :-) cheers On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote: There's no particular reason you have to remove the embedded Jetty server, right? it doesn't prevent you from using it inside another app that happens to run in Tomcat. You won't be able to switch it out without rewriting a fair bit of code, no, but you don't need to. On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of integrating Spark server inside a product. Our current product uses Tomcat as its webserver. Is it possible to switch the Jetty webserver in Spark to Tomcat off-the-shelf? Cheers -- Niranda -- Niranda -- Niranda - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Hive SKEWED feature supported in Spark SQL ?
I have done some testing of inserting into tables defined in Hive using 1.2 and I can see that the PARTITION clause is honored : data files get created in multiple subdirectories correctly. I tried the SKEWED BY ON STORED AS DIRECTORIES clause on the CREATE TABLE clause but I didn't see subdirectories being created in that case. 1) is SKEWED BY honored ? If so, has anyone run into directories not being created ? 2) if it is not honored, does it matter ? Hive introduced this feature to better handle joins where tables had a skewed distribution on keys joined on so that the single mapper handling one of the keys didn't hold up the whole process. Could that happen in Spark / Spark SQL ? Thanks
Re: Replacing Jetty with TomCat
Sure, but you are not using Netty at all. It's invisible to you. It's not as if you have to set up and maintain a Jetty container. I don't think your single platform for your apps is relevant. You can turn off the UI, but as Reynold said, the HTTP servers are also part of the core data transport functionality and you can't turn that off. It's not merely unsupported to swap this out with an arbitrary container, it's not clear it would work with Tomcat without re-integrating with its behavior and tuning. But it also shouldn't matter to anyone. On Thu, Feb 19, 2015 at 8:11 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The issue we have here is that all our products are based on a single platform and we try to make all our products coherent with our platform as much as possible. so, having two web services in one instance would not be a very elegant solution. That is why we were seeking a way to switch it to Tomcat. But as I understand, it is not readily supported, hence we will have to accept it as it is. If we are not using the Spark UIs, is it possible to disable the UIs and prevent the jetty server from starting, but yet use the core spark functionality? Hi Corey, thank you for your ideas. Our biggest concern here was that it starts a new webserver inside spark. opening up new ports etc. might be seen as security threats when it comes to commercial distributions. cheers On Wed, Feb 18, 2015 at 3:25 PM, Sean Owen so...@cloudera.com wrote: I do not think it makes sense to make the web server configurable. Mostly because there's no real problem in running an HTTP service internally based on Netty while you run your own HTTP service based on something else like Tomcat. What's the problem? On Wed, Feb 18, 2015 at 3:14 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi Sean, The main issue we have is, running two web servers in a single product. we think it would not be an elegant solution. Could you please point me to the main areas where jetty server is tightly coupled or extension points where I could plug tomcat instead of jetty? If successful I could contribute it to the spark project. :-) cheers On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote: There's no particular reason you have to remove the embedded Jetty server, right? it doesn't prevent you from using it inside another app that happens to run in Tomcat. You won't be able to switch it out without rewriting a fair bit of code, no, but you don't need to. On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of integrating Spark server inside a product. Our current product uses Tomcat as its webserver. Is it possible to switch the Jetty webserver in Spark to Tomcat off-the-shelf? Cheers -- Niranda -- Niranda -- Niranda - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Have Friedman's glmnet algo running in Spark
Dev List, A couple of colleagues and I have gotten several versions of glmnet algo coded and running on Spark RDD. glmnet algo (http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for generating coefficient paths solving penalized regression with elastic net penalties. The algorithm runs fast by taking an approach that generates solutions for a wide variety of penalty parameter. We're able to integrate into Mllib class structure a couple of different ways. The algorithm may fit better into the new pipeline structure since it naturally returns a multitide of models (corresponding to different vales of penalty parameters). That appears to fit better into pipeline than Mllib linear regression (for example). We've got regression running with the speed optimizations that Friedman recommends. We'll start working on the logistic regression version next. We're eager to make the code available as open source and would like to get some feedback about how best to do that. Any thoughts? Mike Bowles.
Spark SQL, Hive Parquet data types
Still trying to get my head around Spark SQL Hive. 1) Let's assume I *only* use Spark SQL to create and insert data into HIVE tables, declared in a Hive meta-store. Does it matter at all if Hive supports the data types I need with Parquet, or is all that matters what Catalyst spark's parquet relation support ? Case in point : timestamps Parquet * Parquet now supports them as per https://github.com/Parquet/parquet-mr/issues/218 * Hive only supports them in 0.14 So would I be able to read/write timestamps natively in Spark 1.2 ? Spark 1.3 ? I have found this thread http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html which seems to indicate that the data types supported by Hive would matter to Spark SQL. If so, why is that ? Doesn't the read path go through Spark SQL to read the parquet file ? 2) Is there planned support for Hive 0.14 ? Thanks
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
+1 (non-binding) Tested Mesos coarse/fine-grained mode with 4 nodes Mesos cluster with simple shuffle/map task. Will be testing with more complete suite (ie: spark-perf) once the infrastructure is setup to do so. Tim On Thu, Feb 19, 2015 at 12:50 PM, Krishna Sankar ksanka...@gmail.com wrote: Excellent. Explicit toDF() works. a) employees.toDF().registerTempTable(Employees) - works b) Also affects saveAsParquetFile - orders.toDF().saveAsParquetFile Adding to my earlier tests: 4.0 SQL from Scala and Python 4.1 result = sqlContext.sql(SELECT * from Employees WHERE State = 'WA') OK 4.2 result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK 4.3 result = sqlContext.sql(SELECT ShipCountry, Sum(OrderDetails.UnitPrice * Qty * Discount) AS ProductSales FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID GROUP BY ShipCountry) OK 4.4 saveAsParquetFile OK 4.5 Read and verify the 4.4 save - sqlContext.parquetFile, registerTempTable, sql OK Cheers thanks Michael k/ On Thu, Feb 19, 2015 at 12:02 PM, Michael Armbrust mich...@databricks.com wrote: P.S: For some reason replacing import sqlContext.createSchemaRDD with import sqlContext.implicits._ doesn't do the implicit conversations. registerTempTable gives syntax error. I will dig deeper tomorrow. Has anyone seen this ? We will write up a whole migration guide before the final release, but I can quickly explain this one. We made the implicit conversion significantly less broad to avoid the chance of confusing conflicts. However, now you have to call .toDF in order to force RDDs to become DataFrames. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
+1 (non-binding) - Verified signatures using [1] - Built on MacOSX Yosemite - Built on Fedora 21 Each build was run with and Hadoop-2.4 version with yarn, hive, and hive-thriftserver profiles I am having trouble getting all the tests passing on a single run on both machines but we have this same problem on other projects as well. [1] https://github.com/cjnolet/nexus-staging-gpg-verify On Wed, Feb 18, 2015 at 6:25 PM, Sean Owen so...@cloudera.com wrote: On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell pwend...@gmail.com wrote: Patrick this link gives a 404: https://people.apache.org/keys/committer/pwendell.asc Works for me. Maybe it's some ephemeral issue? Yes works now; I swear it didn't before! that's all set now. The signing key is in that file. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org