Build spark failed with maven
Hi, all I got an ERROR when I build spark master branch with maven (commit: |2d1e916730492f5d61b97da6c483d3223ca44315|) |[INFO] [INFO] [INFO] Building Spark Project Catalyst 1.3.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @ spark-catalyst_2.10 --- [INFO] [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ spark-catalyst_2.10 --- [INFO] Source directory: /Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/scala added. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-catalyst_2.10 --- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ spark-catalyst_2.10 --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/resources [INFO] Copying 3 resources [INFO] [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-catalyst_2.10 --- [INFO] Using zinc server for incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [info] Compiling 69 Scala sources and 3 Java sources to /Users/tianyi/github/community/apache-spark/sql/catalyst/target/scala-2.10/classes... [error] /Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:314: polymorphic expression cannot be instantiated to expected type; [error] found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)] [error] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)] [error] implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func) | Any suggestion?
Is there any way to support multiple users executing SQL on thrift server?
Is there any way to support multiple users executing SQL on one thrift server? I think there are some problems for spark 1.2.0, for example: 1. Start thrift server with user A 2. Connect to thrift server via beeline with user B 3. Execute “insert into table dest select … from table src” then we found these items on hdfs: |drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1 drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0 drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00 -rw-r--r-- 3 A supergroup 2671 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0 | You can see all the temporary path created on driver side (thrift server side) is owned by user B (which is what we expected). But all the output data created on executor side is owned by user A, (which is NOT what we expected). error owner of the output data cause |org.apache.hadoop.security.AccessControlException| while the driver side moving output data into |dest| table. Is anyone know how to resolve this problem?
[SPARK-5100][SQL] Spark Thrift server monitor page
Hi, all I have create a JIRA ticket about adding a monitor page for Thrift server. https://issues.apache.org/jira/browse/SPARK-5100 Anyone could review the design doc, and give some advises? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Is there any document to explain how to build the hive jars for spark?
Hi, all We found some bugs in hive-0.12, but we could not wait for hive community fixing them. We want to fix these bugs in our lab and build a new release which could be recognized by spark. As we know, spark depends on a special release of hive, like: |dependency groupIdorg.spark-project.hive/groupId artifactIdhive-metastore/artifactId version${hive.version}/version /dependency | The different between |org.spark-project.hive| and |org.apache.hive| was described by Patrick: |There are two differences: 1. We publish hive with a shaded protobuf dependency to avoid conflicts with some Hadoop versions. 2. We publish a proper hive-exec jar that only includes hive packages. The upstream version of hive-exec bundles a bunch of other random dependencies in it which makes it really hard for third-party projects to use it. | Is there any document to guide us how to build the hive jars for spark? Any help would be greatly appreciated.
Re: How to use multi thread in RDD map function ?
Hi, myasuka Have you checked the jvm gc time of each executor? I think you should increase the SPARK_EXECUTOR_CORES or SPARK_EXECUTOR_INSTANCES until you get the enough concurrency. Here is my recommend config: SPARK_EXECUTOR_CORES=8 SPARK_EXECUTOR_INSTANCES=4 SPARK_WORKER_MEMORY=8G note: make sure you got enough memory on each node, more than SPARK_EXECUTOR_INSTANCES * SPARK_WORKER_MEMORY Best Regards, Yi Tian tianyi.asiai...@gmail.com On Sep 29, 2014, at 21:06, myasuka myas...@live.com wrote: Our cluster is a standalone cluster with 16 computing nodes, each node has 16 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32, we give 512 tasks all together, this situation can help increase the concurrency. But if I set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES to 16, this dosen't work well. Thank you for your reply. Yi Tian wrote for yarn-client mode: SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) * TotalCoresOnYourCluster for standlone mode: SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) * TotalCoresOnYourCluster Best Regards, Yi Tian tianyi.asiainfo@ On Sep 28, 2014, at 17:59, myasuka lt; myasuka@ gt; wrote: Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: *import breeze.linalg.{DenseMatrix = BDM} pairs.map(t = { val b1 = t._2._1.asInstanceOf[BDM[Double]] val b2 = t._2._2.asInstanceOf[BDM[Double]] val c = (b1 * b2).asInstanceOf[BDM[Double]] (new BlockID(t._1.row, t._1.column), c) })* Each node has 16 cores. However, no matter I set 16 tasks or more on each node, the concurrency cannot be higher than 60%, which means not every core on the node is computing. Then I check the running log on the WebUI, according to the amount of shuffle read and write in every task, I see some task do once matrix multiplication, some do twice while some do none. Thus, I think of using java multi thread to increase the concurrency. I wrote a program in scala which calls java multi thread without Spark on a single node, by watch the 'top' monitor, I find this program can use CPU up to 1500% ( means nearly every core are computing). But I have no idea how to use Java multi thread in RDD transformation. Is there any one can provide some example code to use Java multi thread in RDD transformation, or give any idea to increase the concurrency ? Thanks for all -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscribe@.apache For additional commands, e-mail: dev-help@.apache -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Question about SparkSQL and Hive-on-Spark
Hi Reynold! Will sparkSQL strictly obey the HQL syntax ? For example, the cube function. In other words, the hiveContext of sparkSQL should only implement the subset of HQL features? Best Regards, Yi Tian tianyi.asiai...@gmail.com On Sep 23, 2014, at 15:49, Reynold Xin r...@databricks.com wrote: On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian tianyi.asiai...@gmail.com wrote: Hi all, I have some questions about the SparkSQL and Hive-on-Spark Will SparkSQL support all the hive feature in the future? or just making hive as a datasource of Spark? Most likely not *ALL* Hive features, but almost all common features. From Spark 1.1.0 , we have thrift-server support running hql on spark. Will this feature be replaced by Hive on Spark? No. The reason for asking these questions is that we found some hive functions are not running well on SparkSQL ( like window function, cube and rollup function) Is it worth for making effort on implement these functions with SparkSQL? Could you guys give some advices ? Yes absolutely. thank you. Best Regards, Yi Tian tianyi.asiai...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Question about SparkSQL and Hive-on-Spark
Hi all, I have some questions about the SparkSQL and Hive-on-Spark Will SparkSQL support all the hive feature in the future? or just making hive as a datasource of Spark? From Spark 1.1.0 , we have thrift-server support running hql on spark. Will this feature be replaced by Hive on Spark? The reason for asking these questions is that we found some hive functions are not running well on SparkSQL ( like window function, cube and rollup function) Is it worth for making effort on implement these functions with SparkSQL? Could you guys give some advices ? thank you. Best Regards, Yi Tian tianyi.asiai...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Question about SparkSQL and Hive-on-Spark
Hi, Will We are planning to start implementing these functions. We hope that we could make a general design in following week. Best Regards, Yi Tian tianyi.asiai...@gmail.com On Sep 23, 2014, at 23:39, Will Benton wi...@redhat.com wrote: Hi Yi, I've had some interest in implementing windowing and rollup in particular for some of my applications but haven't had them on the front of my plate yet. If you need them as well, I'm happy to start taking a look this week. best, wb - Original Message - From: Yi Tian tianyi.asiai...@gmail.com To: dev@spark.apache.org Sent: Tuesday, September 23, 2014 2:47:17 AM Subject: Question about SparkSQL and Hive-on-Spark Hi all, I have some questions about the SparkSQL and Hive-on-Spark Will SparkSQL support all the hive feature in the future? or just making hive as a datasource of Spark? From Spark 1.1.0 , we have thrift-server support running hql on spark. Will this feature be replaced by Hive on Spark? The reason for asking these questions is that we found some hive functions are not running well on SparkSQL ( like window function, cube and rollup function) Is it worth for making effort on implement these functions with SparkSQL? Could you guys give some advices ? thank you. Best Regards, Yi Tian tianyi.asiai...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [SPARK-3324] make yarn module as a unified maven jar project
Hi Sean Before compile-time, maven could dynamically add either stable or alpha source to the yarn/ project. So there are no incompatible at the compile-time. Here are an example: yarn/pom.xml plugin groupIdorg.codehaus.mojo/groupId artifactIdbuild-helper-maven-plugin/artifactId executions execution idadd-scala-sources/id phasegenerate-sources/phase goals goaladd-source/goal /goals configuration sources sourcecommon/src/main/scala/source source${yarn.api}/src/main/scala/source /sources /configuration /execution /executions /plugin On Aug 31, 2014, at 16:19, Sean Owen so...@cloudera.com wrote: This isn't possible since the two versions of YARN are mutually incompatible at compile-time. However see my comments about how this could be restructured to be a little more standard, and so that IntelliJ would parse it out of the box. Still I imagine it is not worth it if YARN alpha will go away at some point and IntelliJ can easily be told where the extra src/ is. On Sun, Aug 31, 2014 at 3:38 AM, Yi Tian tianyi.asiai...@gmail.com wrote: Hi everyone! I found the YARN module has nonstandard path structure like: ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. I created a JIRA ticket: https://issues.apache.org/jira/browse/SPARK-3324 Any advice will be appreciated . - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[SPARK-3324] make yarn module as a unified maven jar project
Hi everyone! I found the YARN module has nonstandard path structure like: ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. I created a JIRA ticket: https://issues.apache.org/jira/browse/SPARK-3324 Any advice will be appreciated .
Re: Compie error with XML elements
Hi, Devl! I got the same problem. You can try to upgrade your scala plugins to 0.41.2 It works on my mac. On Aug 12, 2014, at 15:19, Devl Devel devl.developm...@gmail.com wrote: When compiling the master checkout of spark. The Intellij compile fails with: Error:(45, 8) not found: value $scope div class=row-fluid ^ which is caused by HTML elements in classes like HistoryPage.scala: val content = div class=row-fluid div class=span12... How can I compile these classes that have html node elements in them? Thanks in advance. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org