struct(1, 2). Please check how the Ordering is
implemented in InterpretedOrdering.
The output itself does not have any ordering. I am not sure why the unit test
and the real env have different environment.
Xiao,
I do see the difference between unit test and local cluster run. Do you know
the reaso
You can try this
https://github.com/hortonworks/shc.git
or here
http://spark-packages.org/package/zhzhan/shc
Currently it is in the process of merging into HBase.
Thanks.
Zhan Zhang
On Apr 21, 2016, at 8:44 AM, Benjamin Kim
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wr
INSERT OVERWRITE will overwrite any existing data in the table or partition
* unless IF NOT EXISTS is provided for a partition (as of Hive
0.9.0<https://issues.apache.org/jira/browse/HIVE-2612>).
Thanks.
Zhan Zhang
On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak
<bkpat..
You can define your own udf, following is one example
Thanks
Zhan Zhang
val foo = udf((a: Int, b: String) => a.toString + b)
checkAnswer(
// SELECT *, foo(key, value) FROM testData
testData.select($"*", foo('key, 'value)).limit(3),
On Apr 21, 2016, at 8:51 PM, Naveen
.
Thanks.
Zhan Zhang
On Apr 20, 2016, at 1:38 AM, 李明伟
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi
the input data size is less than 10M. The task result size should be less I
think. Because I am doing aggregation on the data
At 2016-04-20 16:18:31, "Jeff Zhang&qu
You can try something like below, if you only have one column.
val rdd = parquetFile.javaRDD().map(row => row.getAs[String](0)
Thanks.
Zhan Zhang
On Apr 18, 2016, at 3:44 AM, Ramkumar V
<ramkumar.c...@gmail.com<mailto:ramkumar.c...@gmail.com>> wrote:
HI,
Any idea on this ?
What kind of OOM? Driver or executor side? You can use coredump to find what
cause the OOM.
Thanks.
Zhan Zhang
On Apr 18, 2016, at 9:44 PM, 李明伟
<kramer2...@126.com<mailto:kramer2...@126.com>> wrote:
Hi Samaga
Thanks very much for your reply and sorry for the delay reply.
Cassa
to be materialized in each partition, because
some partition may not have enough number of records, sometimes it is even
empty.
I didn’t see any straightforward walk around for this.
Thanks.
Zhan Zhang
On Dec 23, 2015, at 5:32 PM, 汪洋
<tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com&g
You are using embedded mode, which will create the db locally (in your case,
maybe the db has been created, but you do not have right permission?).
To connect to remote metastore, hive-site.xml has to be correctly configured.
Thanks.
Zhan Zhang
On Dec 23, 2015, at 7:24 AM, Soni spark
Now json, parquet, orc(in hivecontext), text are natively supported. If you use
avro or others, you have to include the package, which are not built into spark
jar.
Thanks.
Zhan Zhang
On Dec 23, 2015, at 8:57 AM, Christopher Brady
<christopher.br...@oracle.com<mailto:christop
SQLContext is in driver side, and I don’t think you can use it in executors.
How to provide lookup functionality in executors really depends on how you
would use them.
Thanks.
Zhan Zhang
On Dec 22, 2015, at 4:44 PM, SRK <swethakasire...@gmail.com> wrote:
> Hi,
>
> Can SQL
In what situation, you have such cases? If there is no shuffle, you can
collapse all these functions into one, right? In the meantime, it is not
recommended to collect
all data to driver.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu
<zchl.j...@yahoo.com.INVALID<mailto:
application.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu
<zchl.j...@yahoo.com.INVALID<mailto:zchl.j...@yahoo.com.INVALID>> wrote:
What is difference between repartition / collect and collapse ...
Is collapse the same costly as collect or repartition ?
Thank
it, at application run time, you can log into the container’s box, and
check the local cache of the container to find whether the log file exists or
not (after app terminate, these local cache files will be deleted as well).
Thanks.
Zhan Zhang
On Dec 18, 2015, at 7:23 AM, Kalpesh Jadhav
<kalpesh.
BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is
ignored. If you want Yarn to honor vcore requests, you have to use
DominantResourceCalculator as Saisai suggested.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 5:30 PM, Saisai Shao
<sai.sai.s...@gmail.com<mailto:sai
If you want dataframe support, you can refer to https://github.com/zhzhan/shc,
which I am working on to integrate to HBase upstream with existing support.
Thanks.
Zhan Zhang
On Dec 15, 2015, at 4:34 AM, censj
<ce...@lotuseed.com<mailto:ce...@lotuseed.com>> wrote:
hi,fight fa
You should be able to get the logs from yarn by “yarn logs -applicationId xxx”,
where you can possible find the cause.
Thanks.
Zhan Zhang
On Dec 15, 2015, at 11:50 AM, Eran Witkon <eranwit...@gmail.com> wrote:
> When running
> val data = sc.wholeTextFile("someDir/*") d
There are two cases here. If the container is killed by yarn, you can increase
jvm overhead. Otherwise, you have to increase the executor-memory if there is
no memory leak happening.
Thanks.
Zhan Zhang
On Dec 15, 2015, at 9:58 PM, Eran Witkon
<eranwit...@gmail.com<mailto:eranwit...@gma
I noticed that it is configurable in job level spark.task.cpus. Anyway to
support on task level?
Thanks.
Zhan Zhang
On Dec 11, 2015, at 10:46 AM, Zhan Zhang <zzh...@hortonworks.com> wrote:
> Hi Folks,
>
> Is it possible to assign multiple core per task and how? Suppo
I think you are fetching too many results to the driver. Typically, it is not
recommended to collect much data to driver. But if you have to, you can
increase the driver memory, when submitting jobs.
Thanks.
Zhan Zhang
On Dec 11, 2015, at 6:14 AM, Tom Seddon
<mr.tom.sed...@gmail.
set if you wan tot
do some performance benchmark.
Thanks.
Zhan Zhang
On Dec 11, 2015, at 9:34 AM, Wei Da <xwd0...@qq.com<mailto:xwd0...@qq.com>>
wrote:
Hi, all
I have done a test in different HW configurations of Spark 1.5.0. A KMeans
algorithm has been ran in four dif
it make sense to add this feature. It may seems
make user worry about more configuration, but by default we can still do 1 core
per task and only advanced users need to be aware of this feature.
Thanks.
Zhan Zhang
-
To unsubscribe
As Sean mentioned, you cannot referring to the local file in your remote
machine (executors). One walk around is to copy the file to all machines within
same directory.
Thanks.
Zhan Zhang
On Dec 11, 2015, at 10:26 AM, Lin, Hao
<hao@finra.org<mailto:hao@finra.org&g
When you have following query, 'account=== “acct1” will be pushdown to generate
new query with “where account = acct1”
Thanks.
Zhan Zhang
On Nov 18, 2015, at 11:36 AM, Eran Medan
<eran.me...@gmail.com<mailto:eran.me...@gmail.com>> wrote:
I understand that the following ar
In the hive-site.xml, you can remove all configuration related to tez and give
it a try again.
Thanks.
Zhan Zhang
On Nov 10, 2015, at 10:47 PM, DaeHyun Ryu
<ry...@kr.ibm.com<mailto:ry...@kr.ibm.com>> wrote:
Hi folks,
I configured tez as execution engine of Hive. After done that
Thanks Ted. I am using latest master branch. I will try your build command and
give it a try.
Thank.
Zhan Zhang
On Nov 9, 2015, at 10:46 AM, Ted Yu
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
Which branch did you perform the build with ?
I used the following comma
Hi Folks,
Does anybody meet the following issue? I use "mvn package -Phive -DskipTests”
to build the package.
Thanks.
Zhan Zhang
bin/spark-shell
...
Spark context available as sc.
error: error while loading QueryExecution, Missing dependency 'bad symbolic
reference. A sign
1:9083
HW11188:spark zzhang$
By the way, I don’t know whether there is any caveat for this walk around.
Thanks.
Zhan Zhang
On Nov 6, 2015, at 2:40 PM, Jerry Lam
<chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote:
Hi Zhan,
I don’t use HiveContext features at
I agree with minor change. Adding a config to provide the option to init
SQLContext or HiveContext, with HiveContext as default instead of bypassing
when hitting the Exception.
Thanks.
Zhan Zhang
On Nov 6, 2015, at 2:53 PM, Ted Yu
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>&g
If you assembly jar have hive jar included, the HiveContext will be used.
Typically, HiveContext has more functionality than SQLContext. In what case you
have to use SQLContext that cannot be done by HiveContext?
Thanks.
Zhan Zhang
On Nov 6, 2015, at 10:43 AM, Jerry Lam
<chiling...@gmail.
Hi Jerry,
https://issues.apache.org/jira/browse/SPARK-11562 is created for the issue.
Thanks.
Zhan Zhang
On Nov 6, 2015, at 3:01 PM, Jerry Lam
<chiling...@gmail.com<mailto:chiling...@gmail.com>> wrote:
Hi Zhan,
Thank you for providing a workaround!
I will try this out but I ag
Looks like some JVM got killed or OOM. You can check the log to see the real
causes.
Thanks.
Zhan Zhang
On Nov 3, 2015, at 9:23 AM, YaoPau
<jonrgr...@gmail.com<mailto:jonrgr...@gmail.com>> wrote:
java.io.FileNotFoun
Spark is a client library. You can just download the latest release or build on
you own, and replace your existing one without changing you existing cluster.
Thanks.
Zhan Zhang
On Nov 3, 2015, at 3:58 PM, roni
<roni.epi...@gmail.com<mailto:roni.epi...@gmail.com>> wrote:
Hi S
the
JIRA number?
Thanks.
Zhan Zhang
On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra
<patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> wrote:
Hi Zhan Zhang,
Is my problem (which is ORC predicate is not generated from WHERE clause even
though spark.sql.orc.filterPushdo
versions of OrcInputFormat. The hive path may use NewOrcInputFormat,
but the spark path use OrcInputFormat.
Thanks.
Zhan Zhang
On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote:
> Yes, the predicate pushdown is enabled, but still take longer time than the
>
In your case, you manually set an AND pushdown, and the predicate is right
based on your setting, : leaf-0 = (EQUALS x 320)
The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)
Thanks.
Zhan Zhang
On Oct 9
That is weird. Unfortunately, there is no debug info available on this part.
Can you please open a JIRA to add some debug information on the driver side?
Thanks.
Zhan Zhang
On Oct 9, 2015, at 10:22 AM, patcharee
<patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> w
Hi Patcharee,
Did you enable the predicate pushdown in the second method?
Thanks.
Zhan Zhang
On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote:
> Hi,
>
> I am using spark sql 1.5 to query a hive table stored as partitioned orc
> file. We have the to
It should be similar to other hadoop jobs. You need hadoop configuration in
your client machine, and point the HADOOP_CONF_DIR in spark to the
configuration.
Thanks
Zhan Zhang
On Sep 22, 2015, at 6:37 PM, Zhiliang Zhu
<zchl.j...@yahoo.com.INVALID<mailto:zchl.j...@yahoo.com.INVALID&g
, the former is used
to access hdfs, and the latter is used to launch application on top of yarn.
Then in the spark-env.sh, you add export HADOOP_CONF_DIR=/etc/hadoop/conf.
Thanks.
Zhan Zhang
On Sep 22, 2015, at 8:14 PM, Zhiliang Zhu
<zchl.j...@yahoo.com<mailto:zchl.j...@yahoo.com>> wro
.
Zhan Zhang
On Sep 22, 2015, at 7:49 PM, Zhiliang Zhu
<zchl.j...@yahoo.com<mailto:zchl.j...@yahoo.com>> wrote:
Hi Zhan,
Thanks very much for your help comment.
I also view it would be similar to hadoop job submit, however, I was not
deciding whether it is like that when
it comes to spar
Hi Krishna,
For the time being, you can download from upstream, and it should be running OK
for HDP2.3. For hdp specific problem, you can ask in Hortonworks forum.
Thanks.
Zhan Zhang
On Sep 22, 2015, at 3:42 PM, Krishna Sankar
<ksanka...@gmail.com<mailto:ksanka...@gmail.com>>
It looks complicated, but I think it would work.
Thanks.
Zhan Zhang
From: Richard Eggert <richard.egg...@gmail.com>
Sent: Saturday, September 19, 2015 3:59 PM
To: User
Subject: PrunedFilteredScan does not work for UDTs and Struct fields
I defined my own rela
If you are using spark-1.4.0, probably it is caused by
SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458
Thanks.
Zhan Zhang
On Aug 23, 2015, at 12:49 PM, lostrain A
donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote:
Ted,
Thanks for the suggestions. Actually
If you run it on yarn with kerberos setup. You authenticate yourself by kinit
before launching the job.
Thanks.
Zhan Zhang
On Jul 28, 2015, at 8:51 PM, Anh Hong
hongnhat...@yahoo.com.INVALIDmailto:hongnhat...@yahoo.com.INVALID wrote:
Hi,
I'd like to remotely run spark-submit from a local
One optimization is to reduce the shuffle by first aggregate locally (only keep
the max for each name), and then reduceByKey.
Thanks.
Zhan Zhang
On Apr 24, 2015, at 10:03 PM, ayan guha
guha.a...@gmail.commailto:guha.a...@gmail.com wrote:
Here you go
t =
[[A,10,A10],[A,20,A20],[A,30,A30
Hi Udit,
By the way, do you mind to share the whole log trace?
Thanks.
Zhan Zhang
On Apr 17, 2015, at 2:26 PM, Udit Mehta
ume...@groupon.commailto:ume...@groupon.com wrote:
I am just trying to launch a spark shell and not do anything fancy. I got the
binary distribution from apache and put
: For spark-1.3, you can use the binary distribution from apache.
Thanks.
Zhan Zhang
On Apr 17, 2015, at 2:01 PM, Udit Mehta
ume...@groupon.commailto:ume...@groupon.com wrote:
I followed the steps described above and I still get this error:
Error: Could not find or load main class
You probably want to first try the basic configuration to see whether it works,
instead of setting SPARK_JAR pointing to the hdfs location. This error is
caused by not finding ExecutorLauncher in class path, and not HDP specific, I
think.
Thanks.
Zhan Zhang
On Apr 17, 2015, at 2:26 PM, Udit
[root@c6402 conf]#
Thanks.
Zhan Zhang
On Apr 17, 2015, at 3:09 PM, Udit Mehta
ume...@groupon.commailto:ume...@groupon.com wrote:
Hi,
This is the log trace:
https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47
On the yarn RM UI, I see :
Error: Could not find or load main class
–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
This is HDP specific question, and you can move the topic to HDP forum.
Thanks.
Zhan Zhang
On Apr 13, 2015, at 3:00 AM, Zork Sail
zorks...@gmail.commailto:zorks...@gmail.com wrote:
Hi Zhan,
Alas setting:
-Dhdp.version=2.2.0.0
/sp[ark-defaults.conf, adding following settings.
spark.driver.extraJavaOptions -Dhdp.version=x
spark.yarn.am.extraJavaOptions -Dhdp.version=x
3. In $SPARK_HOME/java-opts, add following options.
-Dhdp.version=x
Thanks.
Zhan Zhang
On Mar 30, 2015, at 6:56 AM, Doug Balog
Hi Rares,
The number of partition is controlled by HDFS input format, and one file may
have multiple partitions if it consists of multiple block. In you case, I think
there is one file with 2 splits.
Thanks.
Zhan Zhang
On Mar 27, 2015, at 3:12 PM, Rares Vernica
rvern...@gmail.commailto:rvern
Probably guava version conflicts issue. What spark version did you use, and
which hadoop version it compile against?
Thanks.
Zhan Zhang
On Mar 27, 2015, at 12:13 PM, Johnson, Dale
daljohn...@ebay.commailto:daljohn...@ebay.com wrote:
Yes, I could recompile the hdfs client with more logging
[]
| ShuffledRDD[2] at reduceByKey at console:25 []
+-(8) MapPartitionsRDD[1] at map at console:23 []
| ParallelCollectionRDD[0] at parallelize at console:21 []
Thanks.
Zhan Zhang
-
To unsubscribe, e-mail: user
with keeping key part untouched. Then mapValues may not
be able to do this.
Changing the code to allow this is trivial, but I don’t know whether there is
some special reason behind this.
Thanks.
Zhan Zhang
On Mar 26, 2015, at 2:49 PM, Jonathan Coveney
jcove...@gmail.commailto:jcove
Thanks all for the quick response.
Thanks.
Zhan Zhang
On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote:
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:
https://github.com/apache/spark/blob/master/core/src/main
I solve this by increase the PermGen memory size in driver.
-XX:MaxPermSize=512m
Thanks.
Zhan Zhang
On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)
deepuj...@gmail.commailto:deepuj...@gmail.com wrote:
I am facing same issue, posted a new thread. Please respond.
On Wed, Jan 14, 2015 at 4:38 AM
You can do it in $SPARK_HOME/conf/spark-defaults.con
spark.driver.extraJavaOptions -XX:MaxPermSize=512m
Thanks.
Zhan Zhang
On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
deepuj...@gmail.commailto:deepuj...@gmail.com wrote:
Where and how do i pass this or other JVM argument ?
-XX:MaxPermSize
You can try to set it in spark-env.sh.
# - SPARK_LOG_DIR Where log files are stored. (Default:
${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
Thanks.
Zhan Zhang
On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal
anubha...@gmail.commailto:anubha
Probably the port is already used by others, e.g., hive. You can change the
port similar to below
./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf
hive.server2.thrift.port=10001
Thanks.
Zhan Zhang
On Mar 23, 2015, at 12:01 PM, Neil Dev
neilk
Hi Patcharee,
It is an alpha feature in HDP distribution, integrating ATS with Spark history
server. If you are using upstream, you can configure spark as regular without
these configuration. But other related configuration are still mandatory, such
as hdp.version related.
Thanks.
Zhan Zhang
Each RDD has multiple partitions, each of them will produce one hdfs file when
saving output. I don’t think you are allowed to have multiple file handler
writing to the same hdfs file. You still can load multiple files into hive
tables, right?
Thanks..
Zhan Zhang
On Mar 15, 2015, at 7:31 AM
It is during function evaluation in the line search, the value is either
infinite or NaN, which may be caused too large step size. In the code, the step
is reduced to half.
Thanks.
Zhan Zhang
On Mar 13, 2015, at 2:41 PM, cjwang c...@cjwang.us wrote:
I am running LogisticRegressionWithLBFGS
one partition.
iterPartition += 1
}
You can refer RDD.take for example.
Thanks.
Zhan Zhang
On Mar 9, 2015, at 3:41 PM, Shuai Zheng
szheng.c...@gmail.commailto:szheng.c...@gmail.com wrote:
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash)
Thanks.
Zhan Zhang
On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote:
Hello,
I want to execute a hql script through `spark-sql` command, my script
contains:
```
ALTER TABLE xxx
DROP PARTITION
the link to see why the shell
failed in the first place.
Thanks.
Zhan Zhang
On Mar 6, 2015, at 9:59 AM, Todd Nist
tsind...@gmail.commailto:tsind...@gmail.com wrote:
First, thanks to everyone for their assistance and recommendations.
@Marcelo
I applied the patch that you recommended and am now able
Sorry. Misunderstanding. Looks like it already worked. If you still met some
hdp.version problem, you can try it :)
Thanks.
Zhan Zhang
On Mar 6, 2015, at 11:40 AM, Zhan Zhang
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
You are using 1.2.1 right? If so, please add java-opts
You are using 1.2.1 right? If so, please add java-opts in conf directory and
give it a try.
[root@c6401 conf]# more java-opts
-Dhdp.version=2.2.2.0-2041
Thanks.
Zhan Zhang
On Mar 6, 2015, at 11:35 AM, Todd Nist
tsind...@gmail.commailto:tsind...@gmail.com wrote:
-Dhdp.version=2.2.0.0
/
Thanks.
Zhan Zhang
On Mar 5, 2015, at 11:09 AM, Marcelo Vanzin
van...@cloudera.commailto:van...@cloudera.com wrote:
It seems from the excerpt below that your cluster is set up to use the
Yarn ATS, and the code is failing in that path. I think you'll need to
apply the following patch to your
It use HashPartitioner to distribute the record to different partitions, but
the key is just integer evenly across output partitions.
From the code, each resulting partition will get very similar number of
records.
Thanks.
Zhan Zhang
On Mar 4, 2015, at 3:47 PM, Du Li
l...@yahoo
: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Filter Contains(value#5, Restaurant)
HiveTableScan [key#4,value#5], (MetastoreRelation default, testtable, None),
None
scala
Thanks.
Zhan Zhang
On Mar 4, 2015, at 9:09 AM, Anusha Shamanur
anushas
Do you have enough resource in your cluster? You can check your resource
manager to see the usage.
Thanks.
Zhan Zhang
On Mar 3, 2015, at 8:51 AM, abhi
abhishek...@gmail.commailto:abhishek...@gmail.com wrote:
I am trying to run below java class with yarn cluster, but it hangs in accepted
In Yarn (Cluster or client), you can access the spark ui when the app is
running. After app is done, you can still access it, but need some extra setup
for history server.
Thanks.
Zhan Zhang
On Mar 3, 2015, at 10:08 AM, Ted Yu
yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote:
bq
You don’t need to know rdd dependencies to maximize dependencies. Internally
the scheduler will construct the DAG and trigger the execution if there is no
shuffle dependencies in between RDDs.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm
What confused me is the statement of The final result is that rdd1 is
calculated twice.” Is it the expected behavior?
Thanks.
Zhan Zhang
On Feb 26, 2015, at 3:03 PM, Sean Owen
so...@cloudera.commailto:so...@cloudera.com wrote:
To distill this a bit further, I don't think you actually want
.saveAsHadoopFile(…)]
In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen
concurrently.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 3:28 PM, Corey Nolet
cjno...@gmail.commailto:cjno...@gmail.com wrote:
What confused me is the statement of The final result is that rdd1
cores sitting idle.
OOM: increase the memory size, and JVM memory overhead may help here.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 2:03 PM, Yana Kadiyska
yana.kadiy...@gmail.commailto:yana.kadiy...@gmail.com wrote:
Imran, I have also observed the phenomenon of reducing the cores helping
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet,
the optimizer will do column pruning, predictor pushdown, etc. Thus you can
the benefit of parquet column benefits. After that, you can operate the
SchemaRDD (DF) like regular RDD.
Thanks.
Zhan Zhang
On Feb 26
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 4:20 PM, Corey Nolet
cjno...@gmail.commailto:cjno...@gmail.com wrote:
Ted. That one I know. It was the dependency part I was curious about
On Feb
context initiate
YarnClusterSchedulerBackend instead of YarnClientSchedulerBackend, which I
think is the root cause.
Thanks.
Zhan Zhang
On Feb 25, 2015, at 1:53 PM, Zhan Zhang
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
Hi Mate,
When you initialize the JavaSparkContext, you don’t
When you log in, you have root access. Then you can do “su hdfs” or any other
account. Then you can create hdfs directory and change permission, etc.
Thanks
Zhan Zhang
On Feb 11, 2015, at 11:28 PM, guxiaobo1982
guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote:
Hi Zhan,
Yes, I found
You need to have right hdfs account, e.g., hdfs, to create directory and
assign permission.
Thanks.
Zhan Zhang
On Feb 11, 2015, at 4:34 AM, guxiaobo1982
guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote:
Hi Zhan,
My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried
Yes. You need to create xiaobogu under /user and provide right permission to
xiaobogu.
Thanks.
Zhan Zhang
On Feb 7, 2015, at 8:15 AM, guxiaobo1982
guxiaobo1...@qq.commailto:guxiaobo1...@qq.com wrote:
Hi Zhan Zhang,
With the pre-bulit version 1.2.0 of spark against the yarn cluster installed
Not sure spark standalone mode. But on spark-on-yarn, it should work. You can
check following link:
http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
Thanks.
Zhan Zhang
On Feb 5, 2015, at 5:02 PM, Cheng Lian
lian.cs@gmail.commailto:lian.cs@gmail.com wrote:
Please note
I think you can configure hadoop/hive to do impersonation. There is no
difference between secure or insecure hadoop cluster by using kinit.
Thanks.
Zhan Zhang
On Feb 2, 2015, at 9:32 PM, Koert Kuipers
ko...@tresata.commailto:ko...@tresata.com wrote:
yes jobs run as the user that launched
You are running yarn-client mode. How about increase the --driver-memory and
give it a try?
Thanks.
Zhan Zhang
On Jan 29, 2015, at 6:36 PM, QiuxuanZhu
ilsh1...@gmail.commailto:ilsh1...@gmail.com wrote:
Dear all,
I have no idea when it raises an error when I run the following code.
def
I think it is expected. Refer to the comments in saveAsTable Note that this
currently only works with SchemaRDDs that are created from a HiveContext”. If I
understand correctly, here the SchemaRDD means those generated by
HiveContext.sql, instead of applySchema.
Thanks.
Zhan Zhang
On Jan 29
You can put hive-site.xml in your conf/ directory. It will connect to Hive when
HiveContext is initialized.
Thanks.
Zhan Zhang
On Jan 21, 2015, at 12:35 PM, YaoPau jonrgr...@gmail.com wrote:
Is this possible, and if so what steps do I need to take to make this happen?
--
View
Hi Folks,
I am trying to run hive context in yarn-cluster mode, but met some error. Does
anybody know what cause the issue.
I use following cmd to build the distribution:
./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4
15/01/13 17:59:42 INFO
I think it is overflow. The training data is quite big. The algorithms
scalability highly depends on the vocabSize. Even without overflow, there are
still other bottlenecks, for example, syn0Global and syn1Global, each of them
has vocabSize * vectorSize elements.
Thanks.
Zhan Zhang
On Jan
Hi Manas,
There is a small patch needed for HDP2.2. You can refer to this PR
https://github.com/apache/spark/pull/3409
There are some other issues compiling against hadoop2.6. But we will fully
support it very soon. You can ping me, if you want.
Thanks.
Zhan Zhang
On Dec 12, 2014, at 11:38
Please check whether
https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the
problem for launching AM.
Thanks.
Zhan Zhang
On Dec 1, 2014, at 4:49 PM, Mohammad Islam misla...@yahoo.com.INVALID wrote:
Hi,
How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when
. You can refer
to https://github.com/apache/spark/pull/2685 for the whole story.
Thanks.
Zhan Zhang
Thanks.
Zhan Zhang
On Nov 5, 2014, at 4:47 PM, Cheng, Hao hao.ch...@intel.com wrote:
Hi, all, I noticed that when compiling the SparkSQL with profile
“hive-0.13.1”, it will fetch the Hive
] = {
sc.runJob(this, (iter: Iterator[T]) = iter.toArray, Seq(p), allowLocal =
false).head
}
(0 until partitions.length).iterator.flatMap(i = collectPartition(i))
}
Thanks.
Zhan Zhang
On Oct 29, 2014, at 3:43 AM, Yanbo Liang yanboha...@gmail.com wrote:
RDD.toLocalIterator
You can set your executor number with --num-executors. Also changing
yarn-client save you one container for driver. Then check your yarn resource
manager to make sure there are more containers available to serve your extra
apps.
Thanks.
Zhan Zhang
On Oct 28, 2014, at 5:31 PM, Soumya Simanta
I think it is already lazily computed, or do you mean something else? Following
is the signature of compute in RDD
def compute(split: Partition, context: TaskContext): Iterator[T]
Thanks.
Zhan Zhang
On Oct 28, 2014, at 8:15 PM, Dai, Kevin yun...@ebay.com wrote:
Hi, ALL
I have a RDD[T
Can you use row(i).asInstanceOf[]
Thanks.
Zhan Zhang
On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com wrote:
Hi –
The Spark SQL Row class has methods such as getInt, getLong, getBoolean,
getFloat, getDouble, etc. However, I don’t see a getDate method. So how can
Try this
Import org.apache.spark.SparkContext._
Thanks.
Zhan Zhang
On Sep 24, 2014, at 6:13 AM, david david...@free.fr wrote:
thank's
i've already try this solution but it does not compile (in Eclipse)
I'm surprise to see that in Spark-shell, sortByKey works fine on 2
solutions
Here is my understanding
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = {
if (num == 0) { //if 0, return empty array
Array.empty
} else {
mapPartitions { items = //map each partition to a a new one
with the iterator consists of the single queue,
1 - 100 of 110 matches
Mail list logo