Re: How this unit test passed on master trunk?

2016-04-23 Thread Zhan Zhang
struct(1, 2). Please check how the Ordering is implemented in InterpretedOrdering. The output itself does not have any ordering. I am not sure why the unit test and the real env have different environment. Xiao, I do see the difference between unit test and local cluster run. Do you know the reaso

Re: Save DataFrame to HBase

2016-04-22 Thread Zhan Zhang
You can try this https://github.com/hortonworks/shc.git or here http://spark-packages.org/package/zhzhan/shc Currently it is in the process of merging into HBase. Thanks. Zhan Zhang On Apr 21, 2016, at 8:44 AM, Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wr

Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Zhan Zhang
INSERT OVERWRITE will overwrite any existing data in the table or partition * unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0<https://issues.apache.org/jira/browse/HIVE-2612>). Thanks. Zhan Zhang On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak <bkpat..

Re: Spark DataFrame sum of multiple columns

2016-04-22 Thread Zhan Zhang
You can define your own udf, following is one example Thanks Zhan Zhang val foo = udf((a: Int, b: String) => a.toString + b) checkAnswer( // SELECT *, foo(key, value) FROM testData testData.select($"*", foo('key, 'value)).limit(3), On Apr 21, 2016, at 8:51 PM, Naveen

Re: Why Spark having OutOfMemory Exception?

2016-04-21 Thread Zhan Zhang
. Thanks. Zhan Zhang On Apr 20, 2016, at 1:38 AM, 李明伟 <kramer2...@126.com<mailto:kramer2...@126.com>> wrote: Hi the input data size is less than 10M. The task result size should be less I think. Because I am doing aggregation on the data At 2016-04-20 16:18:31, "Jeff Zhang&qu

Re: Read Parquet in Java Spark

2016-04-18 Thread Zhan Zhang
You can try something like below, if you only have one column. val rdd = parquetFile.javaRDD().map(row => row.getAs[String](0) Thanks. Zhan Zhang On Apr 18, 2016, at 3:44 AM, Ramkumar V <ramkumar.c...@gmail.com<mailto:ramkumar.c...@gmail.com>> wrote: HI, Any idea on this ?

Re: Why Spark having OutOfMemory Exception?

2016-04-18 Thread Zhan Zhang
What kind of OOM? Driver or executor side? You can use coredump to find what cause the OOM. Thanks. Zhan Zhang On Apr 18, 2016, at 9:44 PM, 李明伟 <kramer2...@126.com<mailto:kramer2...@126.com>> wrote: Hi Samaga Thanks very much for your reply and sorry for the delay reply. Cassa

Re: Problem using limit clause in spark sql

2015-12-23 Thread Zhan Zhang
to be materialized in each partition, because some partition may not have enough number of records, sometimes it is even empty. I didn’t see any straightforward walk around for this. Thanks. Zhan Zhang On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com&g

Re: Unable to create hive table using HiveContext

2015-12-23 Thread Zhan Zhang
You are using embedded mode, which will create the db locally (in your case, maybe the db has been created, but you do not have right permission?). To connect to remote metastore, hive-site.xml has to be correctly configured. Thanks. Zhan Zhang On Dec 23, 2015, at 7:24 AM, Soni spark

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Zhan Zhang
Now json, parquet, orc(in hivecontext), text are natively supported. If you use avro or others, you have to include the package, which are not built into spark jar. Thanks. Zhan Zhang On Dec 23, 2015, at 8:57 AM, Christopher Brady <christopher.br...@oracle.com<mailto:christop

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Zhan Zhang
SQLContext is in driver side, and I don’t think you can use it in executors. How to provide lookup functionality in executors really depends on how you would use them. Thanks. Zhan Zhang On Dec 22, 2015, at 4:44 PM, SRK <swethakasire...@gmail.com> wrote: > Hi, > > Can SQL

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang
In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID<mailto:

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang
application. Thanks. Zhan Zhang On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID<mailto:zchl.j...@yahoo.com.INVALID>> wrote: What is difference between repartition / collect and collapse ... Is collapse the same costly as collect or repartition ? Thank

Re: Spark with log4j

2015-12-21 Thread Zhan Zhang
it, at application run time, you can log into the container’s box, and check the local cache of the container to find whether the log file exists or not (after app terminate, these local cache files will be deleted as well). Thanks. Zhan Zhang On Dec 18, 2015, at 7:23 AM, Kalpesh Jadhav <kalpesh.

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Zhan Zhang
BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is ignored. If you want Yarn to honor vcore requests, you have to use DominantResourceCalculator as Saisai suggested. Thanks. Zhan Zhang On Dec 21, 2015, at 5:30 PM, Saisai Shao <sai.sai.s...@gmail.com<mailto:sai

Re: About Spark On Hbase

2015-12-15 Thread Zhan Zhang
If you want dataframe support, you can refer to https://github.com/zhzhan/shc, which I am working on to integrate to HBase upstream with existing support. Thanks. Zhan Zhang On Dec 15, 2015, at 4:34 AM, censj <ce...@lotuseed.com<mailto:ce...@lotuseed.com>> wrote: hi,fight fa

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang
You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, where you can possible find the cause. Thanks. Zhan Zhang On Dec 15, 2015, at 11:50 AM, Eran Witkon <eranwit...@gmail.com> wrote: > When running > val data = sc.wholeTextFile("someDir/*") d

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang
There are two cases here. If the container is killed by yarn, you can increase jvm overhead. Otherwise, you have to increase the executor-memory if there is no memory leak happening. Thanks. Zhan Zhang On Dec 15, 2015, at 9:58 PM, Eran Witkon <eranwit...@gmail.com<mailto:eranwit...@gma

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
I noticed that it is configurable in job level spark.task.cpus. Anyway to support on task level? Thanks. Zhan Zhang On Dec 11, 2015, at 10:46 AM, Zhan Zhang <zzh...@hortonworks.com> wrote: > Hi Folks, > > Is it possible to assign multiple core per task and how? Suppo

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang
I think you are fetching too many results to the driver. Typically, it is not recommended to collect much data to driver. But if you have to, you can increase the driver memory, when submitting jobs. Thanks. Zhan Zhang On Dec 11, 2015, at 6:14 AM, Tom Seddon <mr.tom.sed...@gmail.

Re: Performance does not increase as the number of workers increasing in cluster mode

2015-12-11 Thread Zhan Zhang
set if you wan tot do some performance benchmark. Thanks. Zhan Zhang On Dec 11, 2015, at 9:34 AM, Wei Da <xwd0...@qq.com<mailto:xwd0...@qq.com>> wrote: Hi, all I have done a test in different HW configurations of Spark 1.5.0. A KMeans algorithm has been ran in four dif

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
it make sense to add this feature. It may seems make user worry about more configuration, but by default we can still do 1 core per task and only advanced users need to be aware of this feature. Thanks. Zhan Zhang - To unsubscribe

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Zhan Zhang
As Sean mentioned, you cannot referring to the local file in your remote machine (executors). One walk around is to copy the file to all machines within same directory. Thanks. Zhan Zhang On Dec 11, 2015, at 10:26 AM, Lin, Hao <hao@finra.org<mailto:hao@finra.org&g

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Zhan Zhang
When you have following query, 'account=== “acct1” will be pushdown to generate new query with “where account = acct1” Thanks. Zhan Zhang On Nov 18, 2015, at 11:36 AM, Eran Medan <eran.me...@gmail.com<mailto:eran.me...@gmail.com>> wrote: I understand that the following ar

Re: Spark Thrift doesn't start

2015-11-11 Thread Zhan Zhang
In the hive-site.xml, you can remove all configuration related to tez and give it a try again. Thanks. Zhan Zhang On Nov 10, 2015, at 10:47 PM, DaeHyun Ryu <ry...@kr.ibm.com<mailto:ry...@kr.ibm.com>> wrote: Hi folks, I configured tez as execution engine of Hive. After done that

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Zhan Zhang
Thanks Ted. I am using latest master branch. I will try your build command and give it a try. Thank. Zhan Zhang On Nov 9, 2015, at 10:46 AM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: Which branch did you perform the build with ? I used the following comma

Anybody hit this issue in spark shell?

2015-11-09 Thread Zhan Zhang
Hi Folks, Does anybody meet the following issue? I use "mvn package -Phive -DskipTests” to build the package. Thanks. Zhan Zhang bin/spark-shell ... Spark context available as sc. error: error while loading QueryExecution, Missing dependency 'bad symbolic reference. A sign

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
1:9083 HW11188:spark zzhang$ By the way, I don’t know whether there is any caveat for this walk around. Thanks. Zhan Zhang On Nov 6, 2015, at 2:40 PM, Jerry Lam <chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote: Hi Zhan, I don’t use HiveContext features at

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
I agree with minor change. Adding a config to provide the option to init SQLContext or HiveContext, with HiveContext as default instead of bypassing when hitting the Exception. Thanks. Zhan Zhang On Nov 6, 2015, at 2:53 PM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>&g

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
If you assembly jar have hive jar included, the HiveContext will be used. Typically, HiveContext has more functionality than SQLContext. In what case you have to use SQLContext that cannot be done by HiveContext? Thanks. Zhan Zhang On Nov 6, 2015, at 10:43 AM, Jerry Lam <chiling...@gmail.

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
Hi Jerry, https://issues.apache.org/jira/browse/SPARK-11562 is created for the issue. Thanks. Zhan Zhang On Nov 6, 2015, at 3:01 PM, Jerry Lam <chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote: Hi Zhan, Thank you for providing a workaround! I will try this out but I ag

Re: Vague Spark SQL error message with saveAsParquetFile

2015-11-03 Thread Zhan Zhang
Looks like some JVM got killed or OOM. You can check the log to see the real causes. Thanks. Zhan Zhang On Nov 3, 2015, at 9:23 AM, YaoPau <jonrgr...@gmail.com<mailto:jonrgr...@gmail.com>> wrote: java.io.FileNotFoun

Re: Upgrade spark cluster to latest version

2015-11-03 Thread Zhan Zhang
Spark is a client library. You can just download the latest release or build on you own, and replace your existing one without changing you existing cluster. Thanks. Zhan Zhang On Nov 3, 2015, at 3:58 PM, roni <roni.epi...@gmail.com<mailto:roni.epi...@gmail.com>> wrote: Hi S

Re: sql query orc slow

2015-10-13 Thread Zhan Zhang
the JIRA number? Thanks. Zhan Zhang On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra <patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> wrote: Hi Zhan Zhang, Is my problem (which is ORC predicate is not generated from WHERE clause even though spark.sql.orc.filterPushdo

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
versions of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat. Thanks. Zhan Zhang On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote: > Yes, the predicate pushdown is enabled, but still take longer time than the >

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
In your case, you manually set an AND pushdown, and the predicate is right based on your setting, : leaf-0 = (EQUALS x 320) The right way is to enable the predicate pushdown as follows. sqlContext.setConf("spark.sql.orc.filterPushdown", "true”) Thanks. Zhan Zhang On Oct 9

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
That is weird. Unfortunately, there is no debug info available on this part. Can you please open a JIRA to add some debug information on the driver side? Thanks. Zhan Zhang On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> w

Re: sql query orc slow

2015-10-08 Thread Zhan Zhang
Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote: > Hi, > > I am using spark sql 1.5 to query a hive table stored as partitioned orc > file. We have the to

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
It should be similar to other hadoop jobs. You need hadoop configuration in your client machine, and point the HADOOP_CONF_DIR in spark to the configuration. Thanks Zhan Zhang On Sep 22, 2015, at 6:37 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID<mailto:zchl.j...@yahoo.com.INVALID&g

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
, the former is used to access hdfs, and the latter is used to launch application on top of yarn. Then in the spark-env.sh, you add export HADOOP_CONF_DIR=/etc/hadoop/conf. Thanks. Zhan Zhang On Sep 22, 2015, at 8:14 PM, Zhiliang Zhu <zchl.j...@yahoo.com<mailto:zchl.j...@yahoo.com>> wro

Re: how to submit the spark job outside the cluster

2015-09-22 Thread Zhan Zhang
. Zhan Zhang On Sep 22, 2015, at 7:49 PM, Zhiliang Zhu <zchl.j...@yahoo.com<mailto:zchl.j...@yahoo.com>> wrote: Hi Zhan, Thanks very much for your help comment. I also view it would be similar to hadoop job submit, however, I was not deciding whether it is like that when it comes to spar

Re: HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Zhan Zhang
Hi Krishna, For the time being, you can download from upstream, and it should be running OK for HDP2.3. For hdp specific problem, you can ask in Hortonworks forum. Thanks. Zhan Zhang On Sep 22, 2015, at 3:42 PM, Krishna Sankar <ksanka...@gmail.com<mailto:ksanka...@gmail.com>>

Re: PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-19 Thread Zhan Zhang
It looks complicated, but I think it would work. Thanks. Zhan Zhang From: Richard Eggert <richard.egg...@gmail.com> Sent: Saturday, September 19, 2015 3:59 PM To: User Subject: PrunedFilteredScan does not work for UDTs and Struct fields I defined my own rela

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Zhan Zhang
If you are using spark-1.4.0, probably it is caused by SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote: Ted, Thanks for the suggestions. Actually

Re: Authentication Support with spark-submit cluster mode

2015-07-29 Thread Zhan Zhang
If you run it on yarn with kerberos setup. You authenticate yourself by kinit before launching the job. Thanks. Zhan Zhang On Jul 28, 2015, at 8:51 PM, Anh Hong hongnhat...@yahoo.com.INVALIDmailto:hongnhat...@yahoo.com.INVALID wrote: Hi, I'd like to remotely run spark-submit from a local

Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Zhan Zhang
One optimization is to reduce the shuffle by first aggregate locally (only keep the max for each name), and then reduceByKey. Thanks. Zhan Zhang On Apr 24, 2015, at 10:03 PM, ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com wrote: Here you go t = [[A,10,A10],[A,20,A20],[A,30,A30

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.commailto:ume...@groupon.com wrote: I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
: For spark-1.3, you can use the binary distribution from apache. Thanks. Zhan Zhang On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.commailto:ume...@groupon.com wrote: I followed the steps described above and I still get this error: Error: Could not find or load main class

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
You probably want to first try the basic configuration to see whether it works, instead of setting SPARK_JAR pointing to the hdfs location. This error is caused by not finding ExecutorLauncher in class path, and not HDP specific, I think. Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
[root@c6402 conf]# Thanks. Zhan Zhang On Apr 17, 2015, at 3:09 PM, Udit Mehta ume...@groupon.commailto:ume...@groupon.com wrote: Hi, This is the log trace: https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47 On the yarn RM UI, I see : Error: Could not find or load main class

Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zhan Zhang
–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 This is HDP specific question, and you can move the topic to HDP forum. Thanks. Zhan Zhang On Apr 13, 2015, at 3:00 AM, Zork Sail zorks...@gmail.commailto:zorks...@gmail.com wrote: Hi Zhan, Alas setting: -Dhdp.version=2.2.0.0

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-30 Thread Zhan Zhang
/sp[ark-defaults.conf, adding following settings. spark.driver.extraJavaOptions -Dhdp.version=x spark.yarn.am.extraJavaOptions -Dhdp.version=x 3. In $SPARK_HOME/java-opts, add following options. -Dhdp.version=x Thanks. Zhan Zhang On Mar 30, 2015, at 6:56 AM, Doug Balog

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Zhan Zhang
Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.commailto:rvern

Re: Can't access file in spark, but can in hadoop

2015-03-27 Thread Zhan Zhang
Probably guava version conflicts issue. What spark version did you use, and which hadoop version it compile against? Thanks. Zhan Zhang On Mar 27, 2015, at 12:13 PM, Johnson, Dale daljohn...@ebay.commailto:daljohn...@ebay.com wrote: Yes, I could recompile the hdfs client with more logging

RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
[] | ShuffledRDD[2] at reduceByKey at console:25 [] +-(8) MapPartitionsRDD[1] at map at console:23 [] | ParallelCollectionRDD[0] at parallelize at console:21 [] Thanks. Zhan Zhang - To unsubscribe, e-mail: user

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
with keeping key part untouched. Then mapValues may not be able to do this. Changing the code to allow this is trivial, but I don’t know whether there is some special reason behind this. Thanks. Zhan Zhang On Mar 26, 2015, at 2:49 PM, Jonathan Coveney jcove...@gmail.commailto:jcove

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response. Thanks. Zhan Zhang On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote: I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
You can do it in $SPARK_HOME/conf/spark-defaults.con spark.driver.extraJavaOptions -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: Where and how do i pass this or other JVM argument ? -XX:MaxPermSize

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal anubha...@gmail.commailto:anubha

Re: Spark-thriftserver Issue

2015-03-23 Thread Zhan Zhang
Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev neilk

Re: Spark Job History Server

2015-03-20 Thread Zhan Zhang
Hi Patcharee, It is an alpha feature in HDP distribution, integrating ATS with Spark history server. If you are using upstream, you can configure spark as regular without these configuration. But other related configuration are still mandatory, such as hdp.version related. Thanks. Zhan Zhang

Re: Saving Dstream into a single file

2015-03-16 Thread Zhan Zhang
Each RDD has multiple partitions, each of them will produce one hdfs file when saving output. I don’t think you are allowed to have multiple file handler writing to the same hdfs file. You still can load multiple files into hive tables, right? Thanks.. Zhan Zhang On Mar 15, 2015, at 7:31 AM

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread Zhan Zhang
It is during function evaluation in the line search, the value is either infinite or NaN, which may be caused too large step size. In the code, the step is reduced to half. Thanks. Zhan Zhang On Mar 13, 2015, at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
one partition. iterPartition += 1 } You can refer RDD.take for example. Thanks. Zhan Zhang On Mar 9, 2015, at 3:41 PM, Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com wrote: Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread Zhan Zhang
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) Thanks. Zhan Zhang On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote: Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
the link to see why the shell failed in the first place. Thanks. Zhan Zhang On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: You are using 1.2.1 right? If so, please add java-opts

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: -Dhdp.version=2.2.0.0

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Zhan Zhang
/ Thanks. Zhan Zhang On Mar 5, 2015, at 11:09 AM, Marcelo Vanzin van...@cloudera.commailto:van...@cloudera.com wrote: It seems from the excerpt below that your cluster is set up to use the Yarn ATS, and the code is failing in that path. I think you'll need to apply the following patch to your

Re: RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Zhan Zhang
It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. From the code, each resulting partition will get very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Zhan Zhang
: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Filter Contains(value#5, Restaurant) HiveTableScan [key#4,value#5], (MetastoreRelation default, testtable, None), None scala Thanks. Zhan Zhang On Mar 4, 2015, at 9:09 AM, Anusha Shamanur anushas

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Zhan Zhang
Do you have enough resource in your cluster? You can check your resource manager to see the usage. Thanks. Zhan Zhang On Mar 3, 2015, at 8:51 AM, abhi abhishek...@gmail.commailto:abhishek...@gmail.com wrote: I am trying to run below java class with yarn cluster, but it hangs in accepted

Re: Resource manager UI for Spark applications

2015-03-03 Thread Zhan Zhang
In Yarn (Cluster or client), you can access the spark ui when the app is running. After app is done, you can still access it, but need some extra setup for history server. Thanks. Zhan Zhang On Mar 3, 2015, at 10:08 AM, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: bq

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
What confused me is the statement of The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: To distill this a bit further, I don't think you actually want

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.commailto:cjno...@gmail.com wrote: What confused me is the statement of The final result is that rdd1

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
cores sitting idle. OOM: increase the memory size, and JVM memory overhead may help here. Thanks. Zhan Zhang On Feb 26, 2015, at 2:03 PM, Yana Kadiyska yana.kadiy...@gmail.commailto:yana.kadiy...@gmail.com wrote: Imran, I have also observed the phenomenon of reducing the cores helping

Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.commailto:cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I was curious about On Feb

Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
context initiate YarnClusterSchedulerBackend instead of YarnClientSchedulerBackend, which I think is the root cause. Thanks. Zhan Zhang On Feb 25, 2015, at 1:53 PM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: Hi Mate, When you initialize the JavaSparkContext, you don’t

Re: Can't access remote Hive table from spark

2015-02-12 Thread Zhan Zhang
When you log in, you have root access. Then you can do “su hdfs” or any other account. Then you can create hdfs directory and change permission, etc. Thanks Zhan Zhang On Feb 11, 2015, at 11:28 PM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, Yes, I found

Re: Can't access remote Hive table from spark

2015-02-11 Thread Zhan Zhang
You need to have right hdfs account, e.g., hdfs, to create directory and assign permission. Thanks. Zhan Zhang On Feb 11, 2015, at 4:34 AM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan, My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried

Re: Can't access remote Hive table from spark

2015-02-07 Thread Zhan Zhang
Yes. You need to create xiaobogu under /user and provide right permission to xiaobogu. Thanks. Zhan Zhang On Feb 7, 2015, at 8:15 AM, guxiaobo1982 guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote: Hi Zhan Zhang, With the pre-bulit version 1.2.0 of spark against the yarn cluster installed

Re: Can't access remote Hive table from spark

2015-02-05 Thread Zhan Zhang
Not sure spark standalone mode. But on spark-on-yarn, it should work. You can check following link: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/ Thanks. Zhan Zhang On Feb 5, 2015, at 5:02 PM, Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com wrote: Please note

Re: Spark impersonation

2015-02-02 Thread Zhan Zhang
I think you can configure hadoop/hive to do impersonation. There is no difference between secure or insecure hadoop cluster by using kinit. Thanks. Zhan Zhang On Feb 2, 2015, at 9:32 PM, Koert Kuipers ko...@tresata.commailto:ko...@tresata.com wrote: yes jobs run as the user that launched

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Zhan Zhang
You are running yarn-client mode. How about increase the --driver-memory and give it a try? Thanks. Zhan Zhang On Jan 29, 2015, at 6:36 PM, QiuxuanZhu ilsh1...@gmail.commailto:ilsh1...@gmail.com wrote: Dear all, I have no idea when it raises an error when I run the following code. def

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread Zhan Zhang
I think it is expected. Refer to the comments in saveAsTable Note that this currently only works with SchemaRDDs that are created from a HiveContext”. If I understand correctly, here the SchemaRDD means those generated by HiveContext.sql, instead of applySchema. Thanks. Zhan Zhang On Jan 29

Re: Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread Zhan Zhang
You can put hive-site.xml in your conf/ directory. It will connect to Hive when HiveContext is initialized. Thanks. Zhan Zhang On Jan 21, 2015, at 12:35 PM, YaoPau jonrgr...@gmail.com wrote: Is this possible, and if so what steps do I need to take to make this happen? -- View

OOM for HiveFromSpark example

2015-01-13 Thread Zhan Zhang
Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build the distribution: ./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 15/01/13 17:59:42 INFO

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements. Thanks. Zhan Zhang On Jan

Re: Spark 1.2 + Avro file does not work in HDP2.2

2014-12-16 Thread Zhan Zhang
Hi Manas, There is a small patch needed for HDP2.2. You can refer to this PR https://github.com/apache/spark/pull/3409 There are some other issues compiling against hadoop2.6. But we will fully support it very soon. You can ping me, if you want. Thanks. Zhan Zhang On Dec 12, 2014, at 11:38

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Zhan Zhang
Please check whether https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the problem for launching AM. Thanks. Zhan Zhang On Dec 1, 2014, at 4:49 PM, Mohammad Islam misla...@yahoo.com.INVALID wrote: Hi, How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when

Re: Spark SQL Hive Version

2014-11-05 Thread Zhan Zhang
. You can refer to https://github.com/apache/spark/pull/2685 for the whole story. Thanks. Zhan Zhang Thanks. Zhan Zhang On Nov 5, 2014, at 4:47 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, all, I noticed that when compiling the SparkSQL with profile “hive-0.13.1”, it will fetch the Hive

Re: Use RDD like a Iterator

2014-10-30 Thread Zhan Zhang
] = { sc.runJob(this, (iter: Iterator[T]) = iter.toArray, Seq(p), allowLocal = false).head } (0 until partitions.length).iterator.flatMap(i = collectPartition(i)) } Thanks. Zhan Zhang On Oct 29, 2014, at 3:43 AM, Yanbo Liang yanboha...@gmail.com wrote: RDD.toLocalIterator

Re: run multiple spark applications in parallel

2014-10-28 Thread Zhan Zhang
You can set your executor number with --num-executors. Also changing yarn-client save you one container for driver. Then check your yarn resource manager to make sure there are more containers available to serve your extra apps. Thanks. Zhan Zhang On Oct 28, 2014, at 5:31 PM, Soumya Simanta

Re: Use RDD like a Iterator

2014-10-28 Thread Zhan Zhang
I think it is already lazily computed, or do you mean something else? Following is the signature of compute in RDD def compute(split: Partition, context: TaskContext): Iterator[T] Thanks. Zhan Zhang On Oct 28, 2014, at 8:15 PM, Dai, Kevin yun...@ebay.com wrote: Hi, ALL I have a RDD[T

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Zhan Zhang
Can you use row(i).asInstanceOf[] Thanks. Zhan Zhang On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The Spark SQL Row class has methods such as getInt, getLong, getBoolean, getFloat, getDouble, etc. However, I don’t see a getDate method. So how can

Re: sortByKey trouble

2014-09-24 Thread Zhan Zhang
Try this Import org.apache.spark.SparkContext._ Thanks. Zhan Zhang On Sep 24, 2014, at 6:13 AM, david david...@free.fr wrote: thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions

Re: Converting one RDD to another

2014-09-23 Thread Zhan Zhang
Here is my understanding def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = { if (num == 0) { //if 0, return empty array Array.empty } else { mapPartitions { items = //map each partition to a a new one with the iterator consists of the single queue,

  1   2   >