Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response. Thanks. Zhan Zhang On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote: I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
You can do it in $SPARK_HOME/conf/spark-defaults.con spark.driver.extraJavaOptions -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: Where and how do i pass this or other JVM argument ? -XX:MaxPermSize

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal anubha...@gmail.commailto:anubha

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal anubha...@gmail.commailto:anubha

[jira] [Commented] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376361#comment-14376361 ] Zhan Zhang commented on SPARK-3720: --- [~iward] Since this jiar is duplicated to Spark

Re: Spark-thriftserver Issue

2015-03-23 Thread Zhan Zhang
Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev neilk

[jira] [Updated] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Attachment: SparkOffheapsupportbyHDFS.pdf Design doc for hdfs offheap support Provide OffHeap support

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-23 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6479: -- Attachment: SparkOffheapsupportbyHDFS.pdf The design doc also includes stuff from SPARK-6112 Create

Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang
Thanks Reynold, Agree with you to open another JIRA to unify the block storage API. I have upload the design doc to SPARK-6479 as well. Thanks. Zhan Zhang On Mar 23, 2015, at 4:03 PM, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: I created a ticket to separate the API

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-23 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376978#comment-14376978 ] Zhan Zhang commented on SPARK-6479: --- The current API may not be good enough as it has

Re: Spark Job History Server

2015-03-20 Thread Zhan Zhang
Hi Patcharee, It is an alpha feature in HDP distribution, integrating ATS with Spark history server. If you are using upstream, you can configure spark as regular without these configuration. But other related configuration are still mandatory, such as hdp.version related. Thanks. Zhan Zhang

[jira] [Updated] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK

2015-03-19 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Summary: Provide OffHeap support through HDFS RAM_DISK (was: Leverage HDFS RAM_DISK capacity

Re: Saving Dstream into a single file

2015-03-16 Thread Zhan Zhang
Each RDD has multiple partitions, each of them will produce one hdfs file when saving output. I don’t think you are allowed to have multiple file handler writing to the same hdfs file. You still can load multiple files into hive tables, right? Thanks.. Zhan Zhang On Mar 15, 2015, at 7:31 AM

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread Zhan Zhang
It is during function evaluation in the line search, the value is either infinite or NaN, which may be caused too large step size. In the code, the step is reduced to half. Thanks. Zhan Zhang On Mar 13, 2015, at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
one partition. iterPartition += 1 } You can refer RDD.take for example. Thanks. Zhan Zhang On Mar 9, 2015, at 3:41 PM, Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com wrote: Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread Zhan Zhang
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) Thanks. Zhan Zhang On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote: Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
the link to see why the shell failed in the first place. Thanks. Zhan Zhang On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: You are using 1.2.1 right? If so, please add java-opts

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: -Dhdp.version=2.2.0.0

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Zhan Zhang
/ Thanks. Zhan Zhang On Mar 5, 2015, at 11:09 AM, Marcelo Vanzin van...@cloudera.commailto:van...@cloudera.com wrote: It seems from the excerpt below that your cluster is set up to use the Yarn ATS, and the code is failing in that path. I think you'll need to apply the following patch to your

[jira] [Created] (AMBARI-9952) Populate keytab to all spark components

2015-03-05 Thread Zhan Zhang (JIRA)
Zhan Zhang created AMBARI-9952: -- Summary: Populate keytab to all spark components Key: AMBARI-9952 URL: https://issues.apache.org/jira/browse/AMBARI-9952 Project: Ambari Issue Type: Bug

Re: RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Zhan Zhang
It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. From the code, each resulting partition will get very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Zhan Zhang
: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Filter Contains(value#5, Restaurant) HiveTableScan [key#4,value#5], (MetastoreRelation default, testtable, None), None scala Thanks. Zhan Zhang On Mar 4, 2015, at 9:09 AM, Anusha Shamanur anushas

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Zhan Zhang
Do you have enough resource in your cluster? You can check your resource manager to see the usage. Thanks. Zhan Zhang On Mar 3, 2015, at 8:51 AM, abhi abhishek...@gmail.commailto:abhishek...@gmail.com wrote: I am trying to run below java class with yarn cluster, but it hangs in accepted

Re: Resource manager UI for Spark applications

2015-03-03 Thread Zhan Zhang
In Yarn (Cluster or client), you can access the spark ui when the app is running. After app is done, you can still access it, but need some extra setup for history server. Thanks. Zhan Zhang On Mar 3, 2015, at 10:08 AM, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: bq

[jira] [Commented] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343808#comment-14343808 ] Zhan Zhang commented on SPARK-6112: --- Will start scoping it. Leverage HDFS RAM_DISK

[jira] [Updated] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Component/s: (was: Spark Core) Block Manager Leverage HDFS RAM_DISK capacity

[jira] [Created] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-6112: - Summary: Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon Key: SPARK-6112 URL: https://issues.apache.org/jira/browse/SPARK-6112 Project

[jira] [Updated] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Component/s: Spark Core Leverage HDFS RAM_DISK capacity to provide off_heap feature similar

[jira] [Updated] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Component/s: (was: YARN) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar

[jira] [Updated] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Description: HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in hdfs. We may

[jira] [Updated] (SPARK-6112) Leverage HDFS RAM_DISK capacity to provide off_heap feature similar to Tachyon

2015-03-02 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Description: HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in hdfs. We may

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
What confused me is the statement of The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: To distill this a bit further, I don't think you actually want

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.commailto:cjno...@gmail.com wrote: What confused me is the statement of The final result is that rdd1

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
cores sitting idle. OOM: increase the memory size, and JVM memory overhead may help here. Thanks. Zhan Zhang On Feb 26, 2015, at 2:03 PM, Yana Kadiyska yana.kadiy...@gmail.commailto:yana.kadiy...@gmail.com wrote: Imran, I have also observed the phenomenon of reducing the cores helping

Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.commailto:cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I was curious about On Feb

Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
context initiate YarnClusterSchedulerBackend instead of YarnClientSchedulerBackend, which I think is the root cause. Thanks. Zhan Zhang On Feb 25, 2015, at 1:53 PM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: Hi Mate, When you initialize the JavaSparkContext, you don’t

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329664#comment-14329664 ] Zhan Zhang commented on SPARK-1537: --- [~vanzin] If you don't have bandwidth, or don't

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329649#comment-14329649 ] Zhan Zhang commented on SPARK-1537: --- [~vanzin] Thanks for the comments. I don't

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329678#comment-14329678 ] Zhan Zhang commented on SPARK-1537: --- [~vanzin] I declare integrate your code from

[jira] [Comment Edited] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329649#comment-14329649 ] Zhan Zhang edited comment on SPARK-1537 at 2/20/15 10:14 PM

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329700#comment-14329700 ] Zhan Zhang commented on SPARK-1537: --- [~sowen] From the whole context, I believe you

[jira] [Issue Comment Deleted] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-1537: -- Comment: was deleted (was: [~sowen] By the way, I am not waiting for someone to give me the patch

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329704#comment-14329704 ] Zhan Zhang commented on SPARK-1537: --- [~sowen] By the way, I am not waiting for someone

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329828#comment-14329828 ] Zhan Zhang commented on SPARK-1537: --- [~sowen] In JIRA, we share the code so that other

[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-19 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-1537: -- Attachment: SPARK-1537.txt High level design doc for spark ATS integration. Add integration

[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-19 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-1537: -- Attachment: spark-1573.patch Patch against v1.2.1 Add integration with Yarn's Application Timeline

[jira] [Commented] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.

2015-02-18 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326443#comment-14326443 ] Zhan Zhang commented on SPARK-5889: --- https://github.com/apache/spark/pull/4676 remove

[jira] [Created] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.

2015-02-18 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-5889: - Summary: remove pid file in spark-daemon.sh after killing the process. Key: SPARK-5889 URL: https://issues.apache.org/jira/browse/SPARK-5889 Project: Spark Issue

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-18 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326778#comment-14326778 ] Zhan Zhang commented on SPARK-1537: --- I have sent a PR with WIP for people who

Re: Can't access remote Hive table from spark

2015-02-12 Thread Zhan Zhang
When you log in, you have root access. Then you can do “su hdfs” or any other account. Then you can create hdfs directory and change permission, etc. Thanks Zhan Zhang On Feb 11, 2015, at 11:28 PM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, Yes, I found

Re: Can't access remote Hive table from spark

2015-02-11 Thread Zhan Zhang
You need to have right hdfs account, e.g., hdfs, to create directory and assign permission. Thanks. Zhan Zhang On Feb 11, 2015, at 4:34 AM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried

[jira] [Created] (AMBARI-9583) Add kerberos support for spark

2015-02-11 Thread Zhan Zhang (JIRA)
Zhan Zhang created AMBARI-9583: -- Summary: Add kerberos support for spark Key: AMBARI-9583 URL: https://issues.apache.org/jira/browse/AMBARI-9583 Project: Ambari Issue Type: Bug

[jira] [Updated] (AMBARI-9583) Add kerberos support for spark

2015-02-11 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/AMBARI-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated AMBARI-9583: --- Attachment: Ambari-9583.patch patch for kerberos support Add kerberos support for spark

[jira] [Commented] (AMBARI-9583) Add kerberos support for spark

2015-02-11 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/AMBARI-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317191#comment-14317191 ] Zhan Zhang commented on AMBARI-9583: ReviewBoard: https://reviews.apache.org/r/30896

[jira] [Updated] (AMBARI-9583) Add kerberos support for spark

2015-02-11 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/AMBARI-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated AMBARI-9583: --- Attachment: 0001-spark-kerberos-support.patch Change the patch with right format Add kerberos

Re: Can't access remote Hive table from spark

2015-02-07 Thread Zhan Zhang
Yes. You need to create xiaobogu under /user and provide right permission to xiaobogu. Thanks. Zhan Zhang On Feb 7, 2015, at 8:15 AM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan Zhang, With the pre-bulit version 1.2.0 of spark against the yarn cluster installed

Re: Can't access remote Hive table from spark

2015-02-05 Thread Zhan Zhang
Not sure spark standalone mode. But on spark-on-yarn, it should work. You can check following link: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/ Thanks. Zhan Zhang On Feb 5, 2015, at 5:02 PM, Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com wrote: Please note

Re: Welcoming three new committers

2015-02-03 Thread Zhan Zhang
Congratulations! On Feb 3, 2015, at 2:34 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on

Re: Spark impersonation

2015-02-02 Thread Zhan Zhang
I think you can configure hadoop/hive to do impersonation. There is no difference between secure or insecure hadoop cluster by using kinit. Thanks. Zhan Zhang On Feb 2, 2015, at 9:32 PM, Koert Kuipers ko...@tresata.commailto:ko...@tresata.com wrote: yes jobs run as the user that launched

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Zhan Zhang
You are running yarn-client mode. How about increase the --driver-memory and give it a try? Thanks. Zhan Zhang On Jan 29, 2015, at 6:36 PM, QiuxuanZhu ilsh1...@gmail.commailto:ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread Zhan Zhang
I think it is expected. Refer to the comments in saveAsTable Note that this currently only works with SchemaRDDs that are created from a HiveContext”. If I understand correctly, here the SchemaRDD means those generated by HiveContext.sql, instead of applySchema. Thanks. Zhan Zhang On Jan 29

Re: Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread Zhan Zhang
You can put hive-site.xml in your conf/ directory. It will connect to Hive when HiveContext is initialized. Thanks. Zhan Zhang On Jan 21, 2015, at 12:35 PM, YaoPau jonrgr...@gmail.com wrote: Is this possible, and if so what steps do I need to take to make this happen? -- View

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Zhan Zhang
You can try to add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three” Thanks. Zhan Zhang On Jan 16, 2015, at 9:56 AM, Michel Dufresne sparkhealthanalyt...@gmail.com wrote: Hi All, I'm trying to set some JVM

OOM for HiveFromSpark example

2015-01-13 Thread Zhan Zhang
Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build the distribution: ./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 15/01/13 17:59:42 INFO

[jira] [Created] (SPARK-5110) Spark-on-Yarn does not work on windows platform

2015-01-06 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-5110: - Summary: Spark-on-Yarn does not work on windows platform Key: SPARK-5110 URL: https://issues.apache.org/jira/browse/SPARK-5110 Project: Spark Issue Type: Bug

[jira] [Created] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-01-06 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-5111: - Summary: HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark

[jira] [Updated] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.

2015-01-06 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-5108: -- Summary: Need to make jackson dependency version consistent with hadoop-2.6.0. (was: Need to add more

[jira] [Commented] (SPARK-5110) Spark-on-Yarn does not work on windows platform

2015-01-06 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266744#comment-14266744 ] Zhan Zhang commented on SPARK-5110: --- You are right. I will make this duplicated. Spark

[jira] [Closed] (SPARK-5110) Spark-on-Yarn does not work on windows platform

2015-01-06 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang closed SPARK-5110. - Resolution: Duplicate Spark-on-Yarn does not work on windows platform

[jira] [Created] (SPARK-5108) Need to add more jackson dependency for hadoop-2.6.0 support.

2015-01-06 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-5108: - Summary: Need to add more jackson dependency for hadoop-2.6.0 support. Key: SPARK-5108 URL: https://issues.apache.org/jira/browse/SPARK-5108 Project: Spark Issue

[jira] [Commented] (SPARK-5108) Need to add more jackson dependency for hadoop-2.6.0 support.

2015-01-06 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266529#comment-14266529 ] Zhan Zhang commented on SPARK-5108: --- [~sowen] You are right. Need to add more

[jira] [Commented] (SPARK-5108) Need to add more jackson dependency for hadoop-2.6.0 support.

2015-01-06 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266530#comment-14266530 ] Zhan Zhang commented on SPARK-5108: --- java.lang.NoSuchMethodError

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements. Thanks. Zhan Zhang On Jan

Re: Spark 1.2 + Avro file does not work in HDP2.2

2014-12-16 Thread Zhan Zhang
Hi Manas, There is a small patch needed for HDP2.2. You can refer to this PR https://github.com/apache/spark/pull/3409 There are some other issues compiling against hadoop2.6. But we will fully support it very soon. You can ping me, if you want. Thanks. Zhan Zhang On Dec 12, 2014, at 11:38

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Zhan Zhang
Please check whether https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the problem for launching AM. Thanks. Zhan Zhang On Dec 1, 2014, at 4:49 PM, Mohammad Islam misla...@yahoo.com.INVALID wrote: Hi, How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when

Re: How spark and hive integrate in long term?

2014-11-22 Thread Zhan Zhang
some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible. Thanks. Zhan Zhang On Nov 22, 2014, at 7:14 AM, Cheng Lian lian.cs@gmail.com wrote: Should emphasize that this is still a quick and rough conclusion, will investigate

[jira] [Commented] (SPARK-4461) Pass java options to yarn master to handle system properties correctly.

2014-11-21 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221474#comment-14221474 ] Zhan Zhang commented on SPARK-4461: --- I changed the title so that to reflect the issue

[jira] [Commented] (SPARK-4461) Pass java options to yarn master to handle system properties correctly.

2014-11-21 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221502#comment-14221502 ] Zhan Zhang commented on SPARK-4461: --- Thanks for the information Marcelo. I changed

[jira] [Comment Edited] (SPARK-4461) Pass java options to yarn master to handle system properties correctly.

2014-11-21 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221502#comment-14221502 ] Zhan Zhang edited comment on SPARK-4461 at 11/21/14 10:23 PM

How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent

Re: How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
and more features added, it would be great if user can take advantage of both. Current, spark sql give us such benefits partially, but I am wondering how to keep such integration in long term. Thanks. Zhan Zhang On Nov 21, 2014, at 3:12 PM, Dean Wampler deanwamp...@gmail.com wrote: I can't comment

[jira] [Created] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath

2014-11-17 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-4461: - Summary: Spark should not relies on mapred-site.xml for classpath Key: SPARK-4461 URL: https://issues.apache.org/jira/browse/SPARK-4461 Project: Spark Issue Type

[jira] [Updated] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath

2014-11-17 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-4461: -- Description: Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, the library

Re: Spark SQL Hive Version

2014-11-05 Thread Zhan Zhang
. You can refer to https://github.com/apache/spark/pull/2685 for the whole story. Thanks. Zhan Zhang Thanks. Zhan Zhang On Nov 5, 2014, at 4:47 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, all, I noticed that when compiling the SparkSQL with profile “hive-0.13.1”, it will fetch the Hive

[jira] [Commented] (SPARK-3720) support ORC in spark sql

2014-11-03 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195402#comment-14195402 ] Zhan Zhang commented on SPARK-3720: --- [~neoword] wangfei and me are working together

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-11-03 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195427#comment-14195427 ] Zhan Zhang commented on SPARK-2883: --- [~neoword] , As [~marmbrus] mentioned, the PR need

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-10-30 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191052#comment-14191052 ] Zhan Zhang commented on SPARK-1537: --- Yarn-2521 can make client easier to use

Re: Use RDD like a Iterator

2014-10-30 Thread Zhan Zhang
] = { sc.runJob(this, (iter: Iterator[T]) = iter.toArray, Seq(p), allowLocal = false).head } (0 until partitions.length).iterator.flatMap(i = collectPartition(i)) } Thanks. Zhan Zhang On Oct 29, 2014, at 3:43 AM, Yanbo Liang yanboha...@gmail.com wrote: RDD.toLocalIterator

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-10-29 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189651#comment-14189651 ] Zhan Zhang commented on SPARK-1537: --- Hi Marcelo, Do you have update on this? If you

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Zhan Zhang
-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is to enable hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but expected to go to upstream soon (Spark-3720). Thanks. Zhan Zhang On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com wrote

Re: run multiple spark applications in parallel

2014-10-28 Thread Zhan Zhang
You can set your executor number with --num-executors. Also changing yarn-client save you one container for driver. Then check your yarn resource manager to make sure there are more containers available to serve your extra apps. Thanks. Zhan Zhang On Oct 28, 2014, at 5:31 PM, Soumya Simanta

Re: Use RDD like a Iterator

2014-10-28 Thread Zhan Zhang
I think it is already lazily computed, or do you mean something else? Following is the signature of compute in RDD def compute(split: Partition, context: TaskContext): Iterator[T] Thanks. Zhan Zhang On Oct 28, 2014, at 8:15 PM, Dai, Kevin yun...@ebay.com wrote: Hi, ALL I have a RDD[T

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Zhan Zhang
Can you use row(i).asInstanceOf[] Thanks. Zhan Zhang On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The Spark SQL Row class has methods such as getInt, getLong, getBoolean, getFloat, getDouble, etc. However, I don’t see a getDate method. So how can

[jira] [Created] (SPARK-4103) Clean up SessionState in HiveContext

2014-10-27 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-4103: - Summary: Clean up SessionState in HiveContext Key: SPARK-4103 URL: https://issues.apache.org/jira/browse/SPARK-4103 Project: Spark Issue Type: Bug

[jira] [Commented] (SPARK-4103) Clean up SessionState in HiveContext

2014-10-27 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185889#comment-14185889 ] Zhan Zhang commented on SPARK-4103: --- There are already some efforts (Spark-4037

[jira] [Comment Edited] (SPARK-4103) Clean up SessionState in HiveContext

2014-10-27 Thread Zhan Zhang (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185889#comment-14185889 ] Zhan Zhang edited comment on SPARK-4103 at 10/27/14 10:56 PM

<    1   2   3   4   5   6   >