Build spark failed with maven

2015-02-10 Thread Yi Tian

Hi, all

I got an ERROR when I build spark master branch with maven (commit: 
|2d1e916730492f5d61b97da6c483d3223ca44315|)


|[INFO]
[INFO] 
[INFO] Building Spark Project Catalyst 1.3.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @ 
spark-catalyst_2.10 ---
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ 
spark-catalyst_2.10 ---
[INFO] Source directory: 
/Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/scala added.
[INFO]
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
spark-catalyst_2.10 ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
spark-catalyst_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ 
spark-catalyst_2.10 ---
[INFO] Using zinc server for incremental compilation
[INFO] compiler plugin: 
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[info] Compiling 69 Scala sources and 3 Java sources to 
/Users/tianyi/github/community/apache-spark/sql/catalyst/target/scala-2.10/classes...
[error] 
/Users/tianyi/github/community/apache-spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:314:
 polymorphic expression cannot be instantiated to expected type;
[error]  found   : [T(in method 
apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error]  required: 
org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method 
functionToUdfBuilder)]
[error]   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): 
ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
|

Any suggestion?

​


Is there any way to support multiple users executing SQL on thrift server?

2015-01-19 Thread Yi Tian
Is there any way to support multiple users executing SQL on one thrift 
server?


I think there are some problems for spark 1.2.0, for example:

1. Start thrift server with user A
2. Connect to thrift server via beeline with user B
3. Execute “insert into table dest select … from table src”

then we found these items on hdfs:

|drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00
-rw-r--r--   3 A supergroup   2671 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0
|

You can see all the temporary path created on driver side (thrift server 
side) is owned by user B (which is what we expected).


But all the output data created on executor side is owned by user A, 
(which is NOT what we expected).
error owner of the output data cause 
|org.apache.hadoop.security.AccessControlException| while the driver 
side moving output data into |dest| table.


Is anyone know how to resolve this problem?

​


[SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Yi Tian

Hi, all

I have create a JIRA ticket about adding a monitor page for Thrift server.

https://issues.apache.org/jira/browse/SPARK-5100

Anyone could review the design doc, and give some advises?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Is there any document to explain how to build the hive jars for spark?

2014-12-11 Thread Yi Tian

Hi, all

We found some bugs in hive-0.12, but we could not wait for hive 
community fixing them.


We want to fix these bugs in our lab and build a new release which could 
be recognized by spark.


As we know, spark depends on a special release of hive, like:

|dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-metastore/artifactId
  version${hive.version}/version
/dependency
|

The different between |org.spark-project.hive| and |org.apache.hive| was 
described by Patrick:


|There are two differences:

1. We publish hive with a shaded protobuf dependency to avoid
conflicts with some Hadoop versions.
2. We publish a proper hive-exec jar that only includes hive packages.
The upstream version of hive-exec bundles a bunch of other random
dependencies in it which makes it really hard for third-party projects
to use it.
|

Is there any document to guide us how to build the hive jars for spark?

Any help would be greatly appreciated.

​


Re: How to use multi thread in RDD map function ?

2014-09-29 Thread Yi Tian
Hi, myasuka

Have you checked the jvm gc time of each executor? 

I think you should increase the SPARK_EXECUTOR_CORES or 
SPARK_EXECUTOR_INSTANCES until you get the enough concurrency.

Here is my recommend config:

SPARK_EXECUTOR_CORES=8
SPARK_EXECUTOR_INSTANCES=4
SPARK_WORKER_MEMORY=8G

note: make sure you got enough memory on each node, more than 
SPARK_EXECUTOR_INSTANCES * SPARK_WORKER_MEMORY

Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 29, 2014, at 21:06, myasuka myas...@live.com wrote:

 Our cluster is a standalone cluster with 16 computing nodes, each node has 16
 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32,
 we give 512 tasks all together, this situation can help increase the
 concurrency. But if I  set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES
 to 16, this dosen't work well.
 
 Thank you for your reply.
 
 
 Yi Tian wrote
 for yarn-client mode:
 
 SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) *
 TotalCoresOnYourCluster
 
 for standlone mode:
 
 SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) *
 TotalCoresOnYourCluster
 
 
 
 Best Regards,
 
 Yi Tian
 
 tianyi.asiainfo@
 
 
 
 
 
 On Sep 28, 2014, at 17:59, myasuka lt;
 
 myasuka@
 
 gt; wrote:
 
 Hi, everyone
   I come across with a problem about increasing the concurency. In a
 program, after shuffle write, each node should fetch 16 pair matrices to
 do
 matrix multiplication. such as:
 
 *import breeze.linalg.{DenseMatrix = BDM}
 
 pairs.map(t = {
   val b1 = t._2._1.asInstanceOf[BDM[Double]]
   val b2 = t._2._2.asInstanceOf[BDM[Double]]
 
   val c = (b1 * b2).asInstanceOf[BDM[Double]]
 
   (new BlockID(t._1.row, t._1.column), c)
 })*
 
   Each node has 16 cores. However, no matter I set 16 tasks or more on
 each node, the concurrency cannot be higher than 60%, which means not
 every
 core on the node is computing. Then I check the running log on the WebUI,
 according to the amount of shuffle read and write in every task, I see
 some
 task do once matrix multiplication, some do twice while some do none.
 
   Thus, I think of using java multi thread to increase the concurrency.
 I
 wrote a program in scala which calls java multi thread without Spark on a
 single node, by watch the 'top' monitor, I find this program can use CPU
 up
 to 1500% ( means nearly every core are computing). But I have no idea how
 to
 use Java multi thread in RDD transformation.
 
   Is there any one can provide some example code to use Java multi
 thread
 in RDD transformation, or give any idea to increase the concurrency ?
 
 Thanks for all
 
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: 
 
 dev-unsubscribe@.apache
 
 For additional commands, e-mail: 
 
 dev-help@.apache
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question about SparkSQL and Hive-on-Spark

2014-09-24 Thread Yi Tian
Hi Reynold!

Will sparkSQL strictly obey the HQL syntax ?

For example, the cube function.

In other words, the hiveContext of sparkSQL should only implement the subset of 
HQL features?


Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 23, 2014, at 15:49, Reynold Xin r...@databricks.com wrote:

 
 On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian tianyi.asiai...@gmail.com wrote:
 Hi all,
 
 I have some questions about the SparkSQL and Hive-on-Spark
 
 Will SparkSQL support all the hive feature in the future? or just making hive 
 as a datasource of Spark?
 
 Most likely not *ALL* Hive features, but almost all common features.
  
 
 From Spark 1.1.0 , we have thrift-server support running hql on spark. Will 
 this feature be replaced by Hive on Spark?
 
 No.
  
 
 The reason for asking these questions is that we found some hive functions 
 are not  running well on SparkSQL ( like window function, cube and rollup 
 function) 
 
 Is it worth for making effort on implement these functions with SparkSQL? 
 Could you guys give some advices ?
 
 Yes absolutely.
  
 
 thank you.
 
 
 Best Regards,
 
 Yi Tian
 tianyi.asiai...@gmail.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Yi Tian
Hi all,

I have some questions about the SparkSQL and Hive-on-Spark

Will SparkSQL support all the hive feature in the future? or just making hive 
as a datasource of Spark?

From Spark 1.1.0 , we have thrift-server support running hql on spark. Will 
this feature be replaced by Hive on Spark?

The reason for asking these questions is that we found some hive functions are 
not  running well on SparkSQL ( like window function, cube and rollup function)

Is it worth for making effort on implement these functions with SparkSQL? Could 
you guys give some advices ? 

thank you.


Best Regards,

Yi Tian
tianyi.asiai...@gmail.com





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Yi Tian
Hi, Will

We are planning to start implementing these functions.

We hope that we could make a general design in following week.



Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 23, 2014, at 23:39, Will Benton wi...@redhat.com wrote:

 Hi Yi,
 
 I've had some interest in implementing windowing and rollup in particular for 
 some of my applications but haven't had them on the front of my plate yet.  
 If you need them as well, I'm happy to start taking a look this week.
 
 
 best,
 wb
 
 
 - Original Message -
 From: Yi Tian tianyi.asiai...@gmail.com
 To: dev@spark.apache.org
 Sent: Tuesday, September 23, 2014 2:47:17 AM
 Subject: Question about SparkSQL and Hive-on-Spark
 
 Hi all,
 
 I have some questions about the SparkSQL and Hive-on-Spark
 
 Will SparkSQL support all the hive feature in the future? or just making hive
 as a datasource of Spark?
 
 From Spark 1.1.0 , we have thrift-server support running hql on spark. Will
 this feature be replaced by Hive on Spark?
 
 The reason for asking these questions is that we found some hive functions
 are not  running well on SparkSQL ( like window function, cube and rollup
 function)
 
 Is it worth for making effort on implement these functions with SparkSQL?
 Could you guys give some advices ?
 
 thank you.
 
 
 Best Regards,
 
 Yi Tian
 tianyi.asiai...@gmail.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SPARK-3324] make yarn module as a unified maven jar project

2014-08-31 Thread Yi Tian
Hi Sean

Before compile-time, maven could dynamically add either stable or alpha source 
to the yarn/ project.

So there are no incompatible at the compile-time.

Here are an example:

yarn/pom.xml

  plugin
groupIdorg.codehaus.mojo/groupId
artifactIdbuild-helper-maven-plugin/artifactId
executions
  execution
idadd-scala-sources/id
phasegenerate-sources/phase
goals
  goaladd-source/goal
/goals
configuration
  sources
sourcecommon/src/main/scala/source
source${yarn.api}/src/main/scala/source
  /sources
/configuration
  /execution
/executions
  /plugin


On Aug 31, 2014, at 16:19, Sean Owen so...@cloudera.com wrote:

 This isn't possible since the two versions of YARN are mutually
 incompatible at compile-time. However see my comments about how this
 could be restructured to be a little more standard, and so that
 IntelliJ would parse it out of the box.
 
 Still I imagine it is not worth it if YARN alpha will go away at some
 point and IntelliJ can easily be told where the extra src/ is.
 
 On Sun, Aug 31, 2014 at 3:38 AM, Yi Tian tianyi.asiai...@gmail.com wrote:
 Hi everyone!
 
 I found the YARN module has nonstandard path structure like:
 
 ${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)
 
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recognize sources in common directory 
 as sourcePath.
 
 I thought we should change the yarn module to a unified maven jar project,
 and add specify different version of yarn api via maven profile setting.
 
 I created a JIRA ticket: https://issues.apache.org/jira/browse/SPARK-3324
 
 Any advice will be appreciated .
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SPARK-3324] make yarn module as a unified maven jar project

2014-08-30 Thread Yi Tian
Hi everyone!

I found the YARN module has nonstandard path structure like:

${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)

When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recognize sources in common directory as 
sourcePath.

I thought we should change the yarn module to a unified maven jar project, 
and add specify different version of yarn api via maven profile setting.

I created a JIRA ticket: https://issues.apache.org/jira/browse/SPARK-3324

Any advice will be appreciated .






Re: Compie error with XML elements

2014-08-29 Thread Yi Tian
Hi, Devl!

I got the same problem.

You can try to upgrade your scala plugins to  0.41.2

It works on my mac.

On Aug 12, 2014, at 15:19, Devl Devel devl.developm...@gmail.com wrote:

 When compiling the master checkout of spark. The Intellij compile fails
 with:
 
Error:(45, 8) not found: value $scope
  div class=row-fluid
   ^
 which is caused by HTML elements in classes like HistoryPage.scala:
 
val content =
  div class=row-fluid
div class=span12...
 
 How can I compile these classes that have html node elements in them?
 
 Thanks in advance.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org