Hi all,
We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we
find that, just a simple COUNT(*) query will much slower (100x) than Spark
1.2.
I find the most time spent on driver to get HDFS blocks. I find large
amount of get below logs printed:
15/03/30 23:03:43 DEBUG
You can add an internal ip to public hostname mapping in your /etc/hosts
file, if your forwarding is proper then it wouldn't be a problem there
after.
Thanks
Best Regards
On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com wrote:
Hi,
For security reasons, we added a server between
I have update my spark source code to 1.3.1.
the checkpoint works well.
BUT the shuffle data still could not be delete automatically…the disk usage is
still 30TB…
I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
Do you know how to solve my problem?
Sendong Li
在
Yes, this private is checked at compile time and my class is in a subpackage
of org.apache.spark.ui, so the visibility is not the issue, or at least not
as far as I can tell.
--
View this message in context:
Which Spark and Hive release are you using ?
Thanks
On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote:
Hi.
In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable
If TestTable doesn't exist, spark returns an error:
ERROR Hive:
Hello Udit,
Yes, what you ask is possible. If you follow the Spark documentation and
tutorial about how to build stand-alone applications, you can see that it
is possible to build a stand-alone, über-JAR file that includes everything.
For example, if you want to suppress some messages by
My guess is that the `createDataFrame` call is failing here. Can you check
if the schema being passed to it includes the column name and type for the
newly being zipped `features` ?
Joseph probably knows this better, but AFAIK the DenseVector here will need
to be marked as a VectorUDT while
Hi Ted.
Spark 1.2.0 an Hive 0.13.1
Regards.
Miguel Angel.
On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu yuzhih...@gmail.com wrote:
Which Spark and Hive release are you using ?
Thanks
On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote:
Hi.
In HiveContext, when I put this
Thank you, @GuoQiang
I will try to add runGC() to the ALS.scala, and if it works for deleting the
shuffle data, I will tell you :-)
?? 2015??3??314:47??GuoQiang Li wi...@qq.com ??
You can try to enforce garbage collection:
/** Run GC and make sure it actually has run */
def
Hi,i set up a standalone cluster of 5 machines(tmaster, tslave1,2,3,4) with
spark-1.3.0-cdh5.4.0-snapshort. when i execute the sbin/start-all.sh, the
master is ok, but i cant see the web ui. Moreover, the worker logs is something
like this:
Spark assembly has been built with Hive, including
Following your suggestion, I end up with the following implementation :
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = { val schema = transformSchema(dataSet.schema, paramMap,
logging = true) val map = this.paramMap ++ paramMap*
*val features =
In my transformSchema I do specify that the output column type is a VectorUDT :
*override def transformSchema(schema: StructType, paramMap: ParamMap):
StructType = { val map = this.paramMap ++ paramMap
checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false))
Hi all,
DataFrame with an user defined type (here mllib.Vector) created with
sqlContex.createDataFrame can't be saved to parquet file and raise
ClassCastException:
org.apache.spark.mllib.linalg.DenseVector cannot be cast to
org.apache.spark.sql.Row error.
Here is an example of code to reproduce
Yes, we have just modified the configuration, and every thing works fine.
Thanks very much for the help.
On Thu, Mar 19, 2015 at 5:24 PM, Ted Yu yuzhih...@gmail.com wrote:
For YARN, possibly this one ?
property
nameyarn.nodemanager.local-dirs/name
Well that are only the logs of the slaves on mesos level, I'm not sure from
your reply if you can ssh into a specific slave or not, if you can, you
should look at actual output of the application (spark in this case) on a
slave in e.g.
Hello,
@Akhil Das I'm trying to use the experimental API
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
How can we implement a BroadcastHashJoin for spark with python?
My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.
Tables on which join is performed are stored as parquet files.
Please help.
Thanks and regards,
Jitesh
--
View this message in context:
You can use broadcast variable.
See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+
On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote:
Hi,
I have a RDD (rdd1)where each line is split into an array [a,
I had always understood the formulation to be the first option you
describe. Lambda is scaled by the number of items the user has rated /
interacted with. I think the goal is to avoid fitting the tastes of
prolific users disproportionately just because they have many ratings
to fit. This is what's
You can use the HiveContext instead of SQLContext, which should support all the
HiveQL, including lateral view explode.
SQLContext is not supporting that yet.
BTW, nice coding format in the email.
Yong
Date: Tue, 31 Mar 2015 18:18:19 -0400
Subject: Re: SparkSql -
Thanks hbogert. There it is plain as day; it can't find my spark binaries.
I thought it was enough to set SPARK_EXECUTOR_URI in my spark-env.sh since
this is all that's necessary to run spark-shell.sh against a mesos master,
but I also had to set spark.executor.uri in my spark-defaults.conf (or
Hi,
I am using spark-1.3 prebuilt release with hadoop2.4 support and Hadoop 2.4.0.
I wrote a spark application(LoadApp) to generate data in each task and load the
data into HDFS as parquet Files (use “saveAsParquet()” in spark sql)
When few waves (1 or 2) are used in a job, LoadApp could
Thanks for the log. It's really helpful. I created a JIRA to explain why it
will happen: https://issues.apache.org/jira/browse/SPARK-6640
However, will this error always happens in your environment?
Best Regards,
Shixiong Zhu
2015-03-31 22:36 GMT+08:00 sparkdi shopaddr1...@dubna.us:
This is
Hi,
Recently I gave a talk on RDD data structure which gives in depth
understanding of spark internals. You can watch it on youtube
https://www.youtube.com/watch?v=WVdyuVwWcBc. Also slides are on slideshare
http://www.slideshare.net/datamantra/anatomy-of-rdd and code is on github
In my experiment, if I do not call gc() explicitly, the shuffle files will not
be cleaned until the whole job finish… I don’t know why, maybe the rdd could
not be GCed implicitly.
In my situation, a full gc in driver takes about 10 seconds, so I start a
thread in driver to do GC like this :
I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-6637. Since we don't have
a clear answer about how the scaling should be handled. Maybe the best
solution for now is to switch back to the 1.2 scaling. -Xiangrui
On Tue, Mar 31, 2015 at 2:50 PM, Sean Owen so...@cloudera.com
Hi Michael,
Thanks for your response. I am running 1.2.1.
Is there any workaround to achieve the same with 1.2.1?
Thanks,
Jitesh
On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust mich...@databricks.com
wrote:
In Spark 1.3 I would expect this to happen automatically when the parquet
table is
You can do something like:
df.collect().map {
case Row(name: String, age1: Int, age2: Int) = ...
}
On Tue, Mar 31, 2015 at 4:05 PM, roni roni.epi...@gmail.com wrote:
I have 2 paraquet files with format e.g name , age, town
I read them and then join them to get all the names which are in
Hi All,
I am running Spark MR on Mesos. Is there a configuration setting for
Spark to define the minimum required slots (similar to MapReduce's
mapred.mesos.total.reduce.slots.minimum and mapred.mesos.total.map.slots.
minimum)? The most related property I see is this: spark.scheduler.
Can you show us the output of DStream#print() if you have it ?
Thanks
On Tue, Mar 31, 2015 at 2:55 AM, Nicolas Phung nicolas.ph...@gmail.com
wrote:
Hello,
@Akhil Das I'm trying to use the experimental API
It works,thanks for your great help.
On Mon, Mar 30, 2015 at 10:07 PM, Denny Lee denny.g@gmail.com wrote:
Hi Vincent,
This may be a case that you're missing a semi-colon after your CREATE
TEMPORARY TABLE statement. I ran your original statement (missing the
semi-colon) and got the same
Rdd union will result in
1 2
3 4
5 6
7 8
9 10
11 12
What you are trying to do is join.
There must be a logic/key to perform join operation.
I think in your case you want the order (index) to be the joining key here.
RDD is a distributed data structure and is not apt for your
This is the whole output from the shell:
~/spark-1.3.0-bin-hadoop2.4$ sudo bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please
(resending...)
I was thinking the same setup… But the more I think of this problem, and
the more interesting this could be.
If we allocate 50% total memory to Tachyon statically, then the Mesos
benefits of dynamically scheduling resources go away altogether.
Can Tachyon be resource managed by
Hi, all
Running the following code snippet through spark-shell, however cannot see any
cached storage partitions in web ui.
Does this mean that cache now working ? Cause if we issue person.count again
that we cannot say any time consuming
performance upgrading. Hope anyone can explain this
Hi All,
I am quite new to Spark. So please pardon me if it is a very basic question.
I have setup a Hadoop cluster using Hortonwork's Ambari. It has 1 Master and
3 Worker nodes. Currently, it has HDFS, Yarn, MapReduce2, HBase and
ZooKeeper services installed.
Now, I want to install Spark on
guoqiang ??s method works very well ??
it only takes 1TB disk now.
thank you very much!
?? 2015??3??314:47??GuoQiang Li wi...@qq.com ??
You can try to enforce garbage collection:
/** Run GC and make sure it actually has run */
def runGC() {
val weakRef = new
Hi Ted,
Thanks very much, yea, using broadcast is much faster.
Best,
Peng
On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote:
You can use broadcast variable.
See also this thread:
Hi,
I save Parquet files in a partitioned table, so in a path looking like
/path/to/table/myfield=a/ .
But I also kept the field myfield in the Parquet data. Thus. when I query
the field, I get this error:
df.select(myfield).show(10)
Exception in thread main
I have noticed a similar issue when using spark streaming. The spark
shuffle write size increases to a large size(in GB) and then the app
crashes saying:
java.io.FileNotFoundException:
Yep, it's not serializable:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html
You can't return this from a distributed operation since that would
mean it has to travel over the network and you haven't supplied any
way to convert the thing into bytes.
On Tue, Mar 31,
In Spark 1.3 I would expect this to happen automatically when the parquet
table is small ( 10mb, configurable with
spark.sql.autoBroadcastJoinThreshold).
If you are running 1.3 and not seeing this, can you show the code you are
using to create the table?
On Tue, Mar 31, 2015 at 3:25 AM, jitesh129
I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to
expose it via SparkSQL. I am using spark 1.2.1, latest supported by
elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop %
2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m
encountering an issue when I
When I am trying to get the result from Hbase and running mapToPair
function of RRD its giving the error
java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result
Here is the code
// private static JavaPairRDDInteger, Result
getCompanyDataRDD(JavaSparkContext sc) throws IOException
Hi Akhil,
I tried editing the /etc/hosts on the master and on the workers, and seems
it is not working for me.
I tried adding hostname internal-ip and it didn't work. I then tried
adding internal-ip hostname and it didn't work either. I guess I should
also edit the spark-env.sh file?
Thanks!
Its pretty simple, pick one machine as master (say machine A), and lets
call the workers are B,C, and D
*Login to A:*
- Enable passwd less authentication (ssh-keygen)
- Add A's ~/.ssh/id_rsa.pub to B,C,D's ~/.ssh/authorized_keys file
- Download spark binary (that supports your hadoop
Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh?
On 31.3.2015. 19:19, Anny Chen wrote:
Hi Akhil,
I tried editing the /etc/hosts on the master and on the workers, and
seems it is not working for me.
I tried adding hostname internal-ip and it didn't work. I then
tried
When you say you added internal-ip hostname, where you able to ping any
of these from the machine?
You could try setting SPARK_LOCAL_IP on all machines. But make sure you
will be able to bind to that host/ip specified there.
Thanks
Best Regards
On Tue, Mar 31, 2015 at 10:49 PM, Anny Chen
The example in
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
might help
Best,
--
Nan Zhu
http://codingcat.me
On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote:
Yep, it's not serializable:
Hmmm... could you try to set the log dir to
file:/home/hduser/spark/spark-events?
I checked the code and it might be the case that the behaviour changed
between 1.2 and 1.3...
On Mon, Mar 30, 2015 at 6:44 PM, Tom Hubregtsen thubregt...@gmail.com wrote:
The stack trace for the first scenario and
Thanks for the reply.
This will reduce the shuffle write to disk to an extent but for a long
running job(multiple days), the shuffle write would still occupy a lot of
space on disk. Why do we need to store the data from older map tasks to
memory?
On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from
each Map tasks to memory until they they can't fit after which they
are sorted and spilled to disk. You can reduce the Shuffle write to
disk by increasing spark.shuffle.memoryFraction(default 0.2).
By writing the shuffle output
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the
first query is a very expensive query (ex: ‘select *’ on a really big data
set) than any subsequent query seem to get blocked. I would have expected
the second query to run in parallel since I am using the ‘fair’ scheduler
We verified it runs on x86, and are now trying to run it on powerPC. We
currently run into dependency trouble with sbt. I tried installing sbt by
hand and resolving all dependencies by hand, but must have made an error, as
I still get errors.
Original error:
Getting org.scala-sbt sbt 0.13.6 ...
Hi,
If I recall correctly, I've read people integrating REST calls to Spark
Streaming jobs in the user list. I don't imagine any cases for why it
shouldn't be possible.
Best,
Burak
On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir minnown...@gmail.com wrote:
We have have some data on Hadoop that
Hi,
I save Parquet files in a partitioned table, so in /path/to/table/myfield=a/ .
But I also kept the field myfield in the Parquet data. Thus. when I query the
field, I get this error:
df.select(myfield).show(10)
Exception in thread main org.apache.spark.sql.AnalysisException: Ambiguous
?
The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been
specified or not.
I understand that
the two RDDs [must]
Hey Guoqiang and Sendong,
Could you comment on the overhead of calling gc() explicitly? The shuffle
files should get cleaned in a few seconds after checkpointing, but it is
certainly possible to accumulates TBs of files in a few seconds. In this
case, calling gc() may work the same as waiting for
Hi ,
I have 4 parquet files and I want to find data which is common in all of
them
e.g
SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA
ON TableB.aID= TableA.aID)
INNER JOIN TableC ON(TableB.cID= Tablec.cID)
INNER JOIN TableA ta ON(ta.dID= TableD.dID)
I cannot reproduce this error on master, but I'm not aware of any
recent bug fixes that are related. Could you build and try the current
master? -Xiangrui
On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
DataFrame with an user defined type (here mllib.Vector)
use zip
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe,
Hi,
I am using spark 1.2.1
I am using thrift server to query my data.
while executing query CACHE TABLE tablename
Fails with exception
Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in
stage
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am fairly new to the spark ecosystem and I have been trying to setup
a spark on mesos deployment. I can't seem to figure out the best
practices around HDFS and Tachyon. The documentation about Spark's
data-locality section seems to point that
Tachyon should be co-located with Spark in this case.
Best,
Haoyuan
On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan achau...@brightcove.com
wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am fairly new to the spark ecosystem and I have been trying to setup
a spark on mesos
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Haoyuan,
So on each mesos slave node I should allocate/section off some amount
of memory for tachyon (let's say 50% of the total memory) and the rest
for regular mesos tasks?
This means, on each slave node I would have tachyon worker (+ hdfs
Here are a few ways to achieve what your loolking to do:
https://github.com/cjnolet/spark-jetty-server
Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
defines a REST API for Spark
Hue -
Hey Sean,
That is true for explicit model, but not for implicit. The ALS-WR
paper doesn't cover the implicit model. In implicit formulation, a
sub-problem (for v_j) is:
min_{v_j} \sum_i c_ij (p_ij - u_i^T v_j)^2 + lambda * X * \|v_j\|_2^2
This is a sum for all i but not just the users who rate
Hi Udit,
The persisted RDDs in memory is cleared by Spark using LRU policy and you
can also set the time to clear the persisted RDDs and meta-data by setting*
spark.cleaner.ttl *(default infinite). But I am not aware about any
properties to clean the older shuffle write from from disks.
thanks,
So in looking at this a bit more, I gather the root cause is the fact that
the nested fields are represented as rows within rows, is that correct? If
I don't know the size of the json array (it varies), using
x.getAs[Row](0).getString(0) is not really a valid solution.
Is the solution to apply a
Hi All,
Below is the my shell script:
/home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G
--master yarn-client --class com.***.FinancialEngineExecutor
/home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties
My driver will load some resources and then
Jeetendra:
Please extract the information you need from Result and return the
extracted portion - instead of returning Result itself.
Cheers
On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
The example in
We have have some data on Hadoop that needs augmented with data only
available to us via a REST service. We're using Spark to search for, and
correct, missing data. Even though there are a lot of records to scour for
missing data, the total number of calls to the service is expected to be
low, so
I have 2 paraquet files with format e.g name , age, town
I read them and then join them to get all the names which are in both
towns .
the resultant dataset is
res4: Array[org.apache.spark.sql.Row] = Array([name1, age1,
town1,name2,age2,town2])
Name 1 and name 2 are same as I am joining
Thanks Petar and Akhil for the suggestion.
Actually I changed the SPARK_MASTER_IP to the internal-ip, deleted the
export SPARK_PUBLIC_DNS=xx line in the spark-env.sh and also edited
the /etc/hosts as Akhil suggested, and now it is working! However I don't
know which change actually makes it
74 matches
Mail list logo