HDFS rolling upgrade in Hadoop 2.6 (available since 2.4)
http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html
Some parts of NM and RM work preserving restart was released in Hadoop
2.6.0.
YARN-1367 After restart NM should resync with the RM without killing
A wordcounting job for about 1G text file takes 1 hour while input from a
NFS mount. The same job took 30 seconds while input from local file system.
Is there any tuning required for a NFS mount input?
Thanks
Larry
Hi All,
When will the Spark 1.2.0 be released? and What are the features in Spark
1.2.0
Regards,
Rajesh
On Wed, Dec 17, 2014 at 11:14 AM, Andrew Ash and...@andrewash.com wrote:
Releases are roughly every 3mo so you should expect around March if the
pace stays steady.
2014-12-16 22:56
Rui is correct.
Check how many partitions your RDD has after loading the gzipped files.
e.g. rdd.getNumPartitions().
If that number is way less than the number of cores in your cluster (in
your case I suspect the number is 4), then explicitly repartition the RDD
to match the number of cores in
Spark 1.3 does not exist. Spark 1.2 hasn't been released just yet. Which
version of Spark did you mean?
Also, from what I can see in the docs
http://spark.apache.org/docs/1.1.1/building-with-maven.html#specifying-the-hadoop-version,
I believe the latest version of Hadoop that Spark supports is
Hi,
In Spark SQL help document, it says Some of these (such as indexes) are
less important due to Spark SQL’s in-memory computational model. Others are
slotted for future releases of Spark SQL.
- Block level bitmap indexes and virtual columns (used to build indexes)
For our
Hi,
I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
4-line code:
*val t1: SchemaRDD = hiveContext hql select * from product where is_new =
0
val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
tb1.registerTempTable(t1_tmp)
(hiveContext sql select
Spark works fine with 2.4 *and later*. The docs don't mean to imply
2.4 is the last supported version.
On Wed, Dec 17, 2014 at 10:19 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Spark 1.3 does not exist. Spark 1.2 hasn't been released just yet. Which
version of Spark did you mean?
I found solution.
I use HADOOP_MAPRED_HOME in my environment what clashes with spark.
After I set empty HADOOP_MAPRED_HOME spark's started working.
--
View this message in context:
Using TestSQLContext from multiple tests leads to:
SparkException: : Task not serializable
ERROR ContextCleaner: Error cleaning broadcast 10
java.lang.NullPointerException
at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:246)
at
Hi,
I'm tring to use hadoop and spark togehter but they don't work.
If I set HADOOP_MAPRED_HOME to use MRv1 or MRv2 spark stops working.
If I set empty HADOOP_MAPRED_HOME to use spark , hadoop stops working.
I use cloudera 5.1.3.
Best regards,
Morbious
--
View this message in context:
Hi,
I encountered a weird bytecode incompatability issue between spark-core jar
from mvn repo and official spark prebuilt binary.
Steps to reproduce:
1. Download the official pre-built Spark binary 1.1.1 at
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz
2. Launch
You should use the same binaries everywhere. The problem here is that
anonymous functions get compiled to different names when you build
different (potentially) so you actually have one function being called
when another function is meant.
On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui
same issue anyone help please
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p20745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark 1.2.0 is coming in the next 48 hours according to
http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-tc9815.html
On Wed, Dec 17, 2014 at 10:11 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi All,
When will the Spark 1.2.0 be
Hello,
Say, I have the following code:
let something = Something()
someRdd.foreachRdd(something.someMethod)
And in something, I have a lazy member variable that gets created in
something.someMethod.
Would that lazy be created once per node, or once per partition?
Thanks,
Ashic.
Hi All,
I have downloaded pre built Spark 1.1.1 for Hadoop 2.3.0 then i did mvn
install for the jar spark-assembly-1.1.1-hadoop2.3.0.jar available in lib
folder of the spark downloaded and added its dependency as following in my
java program
dependency
groupIdorg.apache.spark/groupId
Patrick,
I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.
-kr, Gerard.
On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote:
The second choice is better. Once you call collect() you are pulling
all of the
Hi All:
I am new to Spark. I recently checked out and built spark 1.2 RC2 as an
assembly.
I then ran spark-ec2 according to:
http://spark.apache.org/docs/latest/ec2-scripts.html
I got master and slave instances in EC2 after running
./src/spark/ec2/spark-ec2 -k mykey -i mykey.pem -s 1 launch
I would think that it has to be per worker.
On Wed, Dec 17, 2014, 6:32 PM Ashic Mahtab as...@live.com wrote:
Hello,
Say, I have the following code:
let something = Something()
someRdd.foreachRdd(something.someMethod)
And in something, I have a lazy member variable that gets created in
Nice, that is the answer I want.
Thanks!
From: Sun, Rui [mailto:rui@intel.com]
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS
Hi, Shuai,
How did you turn off the file split in
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5FS8Mr1
If the problem you encountered is different, please give full stack trace.
Cheers
On Wed, Dec 17, 2014 at 5:43 AM, Amit Singh Hora hora.a...@gmail.com
wrote:
Hi All,
I have downloaded pre built Spark 1.1.1 for Hadoop 2.3.0
Why not is a good option to create a RDD per each 200Mb file and then apply the
pre-calculations before merging them? I think the partitions per RDD must be
transparent to the pre-calculations, and not to set them fixed to optimize the
spark maps/reduces processes.
De: Shuai Zheng
I'm a newbie with Spark,,, a simple question
val errorLines = lines.filter(_.contains(h))
val mapErrorLines = errorLines.map(line = (key, line))
val grouping = errorLinesValue.groupByKeyAndWindow(Seconds(8), Seconds(4))
I get something like:
604: ---
605:
Hi spark users,
Do you know how to read json files using Spark SQL that are LZO compressed?
I'm looking into sqlContext.jsonFile but I don't know how to configure it
to read lzo files.
Best Regards,
Jerry
You can create a DStream that contains the count, transforming the grouped
windowed RDD, like this:
val errorCount = grouping.map{case (k,v) = v.size }
If you need to preserve the key:
val errorCount = grouping.map{case (k,v) = (k,v.size) }
or you if you don't care about the content of the
Hi Ted,
Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because sqlContext.josnFile does not
provide ways to configure the input file format. Do you know if there are
some APIs to do that?
Best Regards,
Jerry
On
See this thread: http://search-hadoop.com/m/JW1q5HAuFv
which references https://issues.apache.org/jira/browse/SPARK-2394
Cheers
On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:
Hi spark users,
Do you know how to read json files using Spark SQL that are LZO compressed?
I am aware of the ADAM project in Berkeley and I am working on Proteomic
searches -
anyone else working in this space
In SQLContext:
def jsonFile(path: String, samplingRatio: Double): SchemaRDD = {
val json = sparkContext.textFile(path)
jsonRDD(json, samplingRatio)
}
Looks like jsonFile() can be enhanced with call to
sparkContext.newAPIHadoopFile()
with proper input file format.
Cheers
On Wed, Dec
Hi all,
I would like to be able to split a RDD in two pieces according to a
predicate. That would be equivalent to applying filter twice, with the
predicate and its complement, which is also similar to Haskell's partition
list function (
yo,
First, here is the scala version:
http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
Boolean):(Repr,Repr)
Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
Hi,
A wordcounting job for about 1G text file takes 1 hour while input from a
NFS mount. The same job took 30 seconds while input from local file system.
Is there any tuning required for a NFS mount input?
Thanks
Larry
A wordcounting job for about 1G text file takes 1 hour while input from a NFS
mount. The same job took 30 seconds while input from local file system.
Is there any tuning required for a NFS mount input?
Thanks
Larry
--
View this message in context:
What I would like to do it's to count the number of elements and if
it's greater than a number, I have to iterate all them and store them
in mysql or another system. So, I need to count them and preserve the
values because saving in other system.
I know about this map(line = (key, line)), it was
Thanks for the correction, Sean. Do the docs need to be updated on this
point, or is it safer for now just to note 2.4 specifically?
On Wed Dec 17 2014 at 5:54:53 AM Sean Owen so...@cloudera.com wrote:
Spark works fine with 2.4 *and later*. The docs don't mean to imply
2.4 is the last
Basically what I want to do it'd be something like..
val errorLines = lines.filter(_.contains(h))
val mapErrorLines = errorLines.map(line = (key, line))
val grouping = errorLinesValue.groupByKeyAndWindow(Seconds(8), Seconds(4))
if (errorLinesValue.getValue().size() X){
//iterate values and
Hi Andy, thanks for your response. I already thought about filtering
twice, that was what I meant with that would be equivalent to applying
filter twice, but I was thinking if I could do it in a single pass, so
that could be later generalized to an arbitrary numbers of classes. I would
also like
Foreach is slightly more efficient because Spark doesn't bother to try
and collect results from each task since it's understood there will be
no return type. I think the difference is very marginal though - it's
mostly stylistic... typically you use foreach for something that is
intended to
Just to clarify, are you running the application using spark-submit after
packaging with sbt package ? One thing that might help is to mark the Spark
dependency as 'provided' as then you shouldn't have the Spark classes in
your jar.
Thanks
Shivaram
On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen
Thanks for your replies.
I was building spark from trunk.
Daniel
On 17 בדצמ׳ 2014, at 19:49, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Thanks for the correction, Sean. Do the docs need to be updated on this
point, or is it safer for now just to note 2.4 specifically?
On Wed
You can create an RDD[String] using whatever method and pass that to
jsonRDD.
On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Ted,
Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because
- Dev list
Have you looked at partitioned table support? That would only scan data
where the predicate matches the partition. Depending on the cardinality of
the customerId column that could be a good option for you.
On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid
Hi,
When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following
query does not work:
create table debug as
select v1.*
from t1 as v1 left join t2 as v2
on v1.sku = v2.sku
where v2.sku is null
Both t1 and t2 have 200 partitions.
t1 has 10k rows, and t2 has 4k rows.
this query
Hi Michael,
This is what I did. I was thinking if there is a more efficient way to
accomplish this.
I was doing a very simple benchmark: Convert lzo compressed json files to
parquet files using SparkSQL vs. Hadoop MR.
Spark SQL seems to require 2 stages to accomplish this task:
Stage 1: read
Hi Akhil,
Thanks for your time. I appreciate .I tried this approach , but either I am
getting less files or more files not exact hour files.
Is there any way I can tell the range (between this time to this time)
Thanks,
D
On Tue, Dec 16, 2014 at 11:04 PM, Akhil Das ak...@sigmoidanalytics.com
The first pass is inferring the schema of the JSON data. If you already
know the schema you can skip this pass by specifying the schema as the
second parameter to jsonRDD.
On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Michael,
This is what I did. I was thinking
To be a little more clear jsonRDD and jsonFile use the same implementation
underneath. jsonFile is just a connivence method that does
jsonRDD(sc.textFile(...))
On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com
wrote:
The first pass is inferring the schema of the JSON
The problem is very likely NFS, not Spark. What kind of network is it mounted
over? You can also test the performance of your NFS by copying a file from it
to a local disk or to /dev/null and seeing how many bytes per second it can
copy.
Matei
On Dec 17, 2014, at 9:38 AM, Larryliu
Hi, Matei
Thanks for your response.
I tried to copy the file (1G) from NFS and took 10 seconds. The NFS mount
is a LAN environment and the NFS server is running on the same server that
Spark is running on. So basically I mount the NFS on the same bare metal
machine.
Larry
On Wed, Dec 17, 2014
I see, you may have something else configured weirdly then. You should look at
CPU and disk utilization while your Spark job is reading from NFS and, if you
see high CPU use, run jstack to see where the process is spending time. Also
make sure Spark's local work directories (spark.local.dir)
Yeah, it looks like messages that are successfully posted via Nabble end up
on the Apache mailing list, but messages posted directly to Apache aren't
mirrored to Nabble anymore because it's based off the incubator mailing
list. We should fix this so that Nabble posts to / archives the
Thanks, Matei.
I will give it a try.
Larry
On Wed, Dec 17, 2014 at 1:01 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I see, you may have something else configured weirdly then. You should
look at CPU and disk utilization while your Spark job is reading from NFS
and, if you see high CPU
I am trying to run stateful Spark Streaming computations over (fake)
apache web server logs read from Kafka. The goal is to sessionize
the web traffic similar to this blog post:
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
Guys,
I'm trying to join 2-3 schemaRDD's for approx 30,000 rows and it is terribly
slow.No doubt I get the results but it takes 8s to do the join and get the
results.
I'm running on a standalone spark in my m/c having 8 cores and 12gb RAM with
4 workers.
Not sure why it is consuming time,any
I tried sending this message to the users list several hours ago, but it did
not get distributed.
I was just trying to build Spark, v1.1.1 with defaults. It sets hadoop.version
to 1.0.4, and yarn.version to hadoop.version, the dependency entry for
org.apache.hadoop:hadoop-yarn-common sets
There's no such version of YARN. But you only build the YARN support
when you set -Pyarn. Then, yes you have to set yarn.version separately
for earlier versions that didn't match up with Hadoop versions.
http://spark.apache.org/docs/latest/building-with-maven.html
On Thu, Dec 18, 2014 at 12:35
Yeah, I have the same problem with 1.1.0, but not 1.0.0.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-avro-mapred-AvroKey-using-spark-with-avro-tp15165p20752.html
Sent from the Apache Spark User List mailing
Greetings,
First comment on the issue says that reason for non-supporting of multiple
contexts is
There are numerous assumptions in the code base that uses a shared cache or
thread local variables or some global identifiers
which prevent us from using multiple SparkContext's.
May it be worked
Jim Carroll wrote
Okay,
I have an rdd that I want to run an aggregate over but it insists on
spilling to disk even though I structured the processing to only require a
single pass.
In other words, I can do all of my processing one entry in the rdd at a
time without persisting anything.
Hi Anton,
That could solve some of the issues (I've played with that a little
bit). But there are still some areas where this would be sub-optimal,
because Spark still uses system properties in some places and those
are global, not per-class loader.
(SparkSubmit is the biggest offender here, but
I am not sure this can help you. I have 57 million rating,about 4million user
and 4k items. I used 7-14 total-executor-cores,executal-memory 13g,cluster
have 4 nodes,each have 4cores,max memory 16g.
I found set as follows may help avoid this problem:
Sean,
Yes, the problem is exactly anonymous function mis-matching as you described
So if an Spark app (driver) depends on a Spark module jar (for example
spark-core) to programmatically communicate with a Spark cluster, user should
not use pre-built Spark binary but build Spark from the source
Not using spark-submit. The App directly communicates with the Spark cluster in
standalone mode.
If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere
must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an
assembly jar, not having individual
This might be because Spark SQL first does a shuffle on both the tables
involved in join on the Join condition as key.
I had a specific use case of join where I always Join on specific column
id and have an optimisation lined up for that in which i can cache the
data partitioned on JOIN key id
Thanks, I didn't try the partitioned table support (sounds like a hive feature)
Is there any guideline? Should I use hiveContext to create the table with
partition firstly?
On Thursday, December 18, 2014 2:28 AM, Michael Armbrust
mich...@databricks.com wrote:
- Dev list
Have you
Hi there
The following is my steps. And got the same exception with Daniel's.
Another question: how can I build a tgz file like the pre-build file I
download from official website?
1. download trunk from git.
2. add following lines in pom.xml
+ profile
+ idhadoop-2.6/id
+
Hi,
On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell pwend...@gmail.com wrote:
On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com
wrote:
I was wondering why one would choose for rdd.map vs rdd.foreach to
execute a
side-effecting function on an RDD.
Personally, I like to
All,
I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset
I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
- Hadoop - GCS Connector - GCS.
On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com
wrote:
All,
I'm using the Spark
Denny,
No, gsutil scans through the listing of the bucket quickly. See the
following.
alex@hadoop-m:~/split$ time bash -c gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l
6860
real0m6.971s
user0m1.052s
sys 0m0.096s
Alex
On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?
On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com
wrote:
Denny,
No, gsutil scans through the listing of the bucket
Hi guys,
Getting the following errors,
2014-12-17 09:05:02,391 [SocialInteractionDAL.scala:Executor task launch
worker-110:20] - --- Inserting into mongo -
2014-12-17 09:05:06,768 [ Logging.scala:Executor task launch
worker-110:96] - Exception in task 1.0 in stage
Well, what do you suggest I run to test this? But more importantly, what
information would this give me?
On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would
You can go through this doc for tuning
http://spark.apache.org/docs/latest/tuning.html
Looks like you are creating a lot of objects and the JVM is spending more
time clearing these. If you can paste the code snippet, then it will be
easy to understand whats happening.
Thanks
Best Regards
On
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.
The reason I was asking if you could run bdutil is because it would be
basically Hadoop
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?
On
77 matches
Mail list logo