At Cloudera we recommend bundling your application separately from the
Spark libraries. The two biggest reasons are:
* No need to modify your application jar when upgrading or applying a patch.
* When running on YARN, the Spark jar can be cached as a YARN local
resource, meaning it doesn't need
Here is a tutorial on how to customize your own file format in hadoop:
https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
and once you get your own file format, you can use it the same way as
TextInputFormat in spark as you have done in this post.
--
View this message in
There is a VertexPartition in the EdgePartition,which is created by
EdgePartitionBuilder.toEdgePartition.
and There is also a ShippableVertexPartition in the VertexRDD.
These two Partitions have a lot of common things like index, data and
Bitset, why is this necessary?
--
View this message in
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
with 2048 buckets.
pre
sqlsc.set(spark.sql.shuffle.partitions, 2048)
hql(|insert %s table mz_log
|PARTITION (date='%s')
|select * from tmp_mzlog
the value in (key, value) returned by textFile is exactly one line of the
input.
But what I want is the field between the two “!!”, hope this makes sense.
-
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context:
I am trying to serialize objects contained in RDDs using runtime relfection
via TypeTag. However, the Spark job keeps
failing java.io.NotSerializableException on an instance of TypeCreator
(auto generated by compiler to enable TypeTags). Is there any workaround
for this without switching to scala
Nop.
My input file's format is:
!!
string1
string2
!!
string3
string4
sc.textFile(path) will return RDD(!!, string1, string2, !!,
string3, string4)
what we need now is to transform this rdd to RDD(string1, string2,
string3, string4)
your solution may not handle this.
-
Senior in
Oh, you literally mean these are different lines, not the structure of a line.
You can't solve this in general by reading the entire file into one
string. If the input is tens of gigabytes you will probably exhaust
memory on any of your machines. (Or, you might as well not bother with
Spark
Exactly, the fields between !! is a (key, value) customized data structure.
So, newAPIHadoopFile may be the best practice now. For this specific format,
change the delimiter from default \n to !!\n can be the cheapest, and
this can only be done in hadoop2.x, in hadoop1.x, this can be done by
Hi
I was using saveAsTextFile earlier. It was working fine. When we migrated to
spark-1.0, I started getting the following error:
java.lang.ClassNotFoundException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:366)
Hi,
I have started a EC2 cluster using Spark by running spark-ec2 script.
Just a little confused, I can not find sbt/ directory under /spark.
I have checked spark-version, it's 1.0.0 (default). When I was working
0.9.x, sbt/ has been there.
Is the script changed in 1.0.X ? I can not find any
update:
Just checked the python launch script, when retrieving spark, it will refer
to this script:
https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh
where each version number is mapped to a tar file,
0.9.2)
if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
wget
+user list
bcc: dev list
It's definitely possible to implement credit fraud management using Spark.
A good start would be using some of the supervised learning algorithms
that Spark provides in MLLib (logistic regression or linear SVMs).
Spark doesn't have any HMM implementation right now.
A quick fix would be to implement java.io.Serializable in those classes
which are causing this exception.
Thanks
Best Regards
On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi all,
I was wondering if someone has conceived a method for
Also check the guides for the JVM option that prints messages for such
problems.
Sorry, sent from phone and don't know it by heart :/
Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a écrit :
A quick fix would be to implement java.io.Serializable in those classes
which are causing
Hi Aureliano,
Will it be possible for you to give the test-case ? You can add it to JIRA
as well as an attachment I guess...
I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab
experiments with scaling the rank to 1000 and beyond (which is too high for
ALS but gives a good
It is possible that the answer (the final solution vector x) given by two
different algorithms (such as the one in mllib and in R) are different, as
the problem may not be strictly convex and multiple global optimum may
exist. However, these answers should admit the same objective values. Can
you
19678 fetcher.cpp:102] Downloading resource from
'hdfs://localhost:8020/spark/spark-1.0.0-cdh5.1.0-bin-2.3.0-cdh5.0.3.tgz'
to
'/tmp/mesos/slaves/20140724-134606-16777343-5050-25095-0/frameworks/20140728-143300-24815808-5050-19059-/executors/20140724-134606-16777343-5050-25095-0/runs/c9c9eaa2-b722
On Mon, Jul 28, 2014 at 4:29 AM, Larry Xiao xia...@sjtu.edu.cn wrote:
On 7/28/14, 3:41 PM, shijiaxin wrote:
There is a VertexPartition in the EdgePartition,which is created by
EdgePartitionBuilder.toEdgePartition.
and There is also a ShippableVertexPartition in the VertexRDD.
These two
hey we used to publish spark inhouse by simply overriding the publishTo
setting. but now that we are integrated in SBT with maven i cannot find it
anymore.
i tried looking into the pom file, but after reading 1144 lines of xml i
1) havent found anything that looks like publishing
2) i feel
Hi Team,
Could you please let me know example program/link for JavaSparkSql to join
2 Hbase tables.
Regards,
Rajesh
and if i want to change the version, it seems i have to change it in all 23
pom files? mhhh. is it mandatory for these sub-project pom files to repeat
that version info? useful?
spark$ grep 1.1.0-SNAPSHOT * -r | wc -l
23
On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote:
This is not something you edit yourself. The Maven release plugin
manages setting all this. I think virtually everything you're worried
about is done for you by this plugin.
Maven requires artifacts to set a version and it can't inherit one. I
feel like I understood the reason this is necessary
I am trying to run an example Spark standalone app with the following code
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkGensimLDA extends App{
val ssc=new StreamingContext(local,testApp,Seconds(5))
val
ah ok thanks. guess i am gonna read up about maven-release-plugin then!
On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:
This is not something you edit yourself. The Maven release plugin
manages setting all this. I think virtually everything you're worried
about is done
Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0
when you specify the location in conf/spark-defaults.conf for
spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable.
For example, I'm running the command with user 'test'.
In
I think the 1.0 AMI only contains the prebuilt packages (i.e just the
binaries) of Spark and not the source code. If you want to build Spark on
EC2, you'll can clone the github repo and then use sbt.
Thanks
Shivaram
On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote:
So after months and months, I finally started to try and tackle this, but
my scala ability isn't up to it.
The problem is that, of course, even with the common interface, we don't
want inter-operability between RDDs and DStreams.
I looked into Monads, as per Ashish's suggestion, and I think I
Hi Xiangru,
thanks for the explanation.
1. You said we have to broadcast m * k centers (with m = number of rows). I
thought there were only k centers at each time, which would the have size of
n * k and needed to be broadcasted. Is that I typo or did I understand
something wrong?
And the
I have a file in s3 that I want to map each line with an index. Here is my
code:
input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache()
N input_data.count()
index = sc.parallelize(range(N), 6)
index.zip(input_data).collect()
...
14/07/28 19:49:31 INFO DAGScheduler: Completed
Hi Andrew,
It's definitely not bad practice to use spark-shell with HistoryServer. The
issue here is not with spark-shell, but the way we pass Spark configs to
the application. spark-defaults.conf does not currently support embedding
environment variables, but instead interprets everything as a
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.
2.
Hi Andrew,
Thanks to re-confirm the problem. I thought it only happens to my own build. :)
by the way, we have multiple users using the spark-shell to explore their
dataset, and we are continuously looking into ways to isolate their jobs
history. In the current situation, we can't really ask
Hi Jianshi,
My understanding is 'No' based on how Spark's is designed even with your own
log4j.properties in the Spark's conf folder.
In YARN mode, the Application Master is running inside the cluster and all logs
are part of containers log which is defined by another log4j.properties file
from
Thank you for your reply.
I need sbt for packaging my project and then submit it.
Could you tell me how to run a spark project on 1.0 AMI without sbt?
I don't understand why 1.0 only contains the prebuilt packages. I dont think
it makes sense, since sbt is essential.
User has to download sbt
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the
number of executors + the driver?
E.g., if I submit a job with 10 executors, I get 11 for
getExeuctorStorageStatus.length and getExecutorMemoryStatus.size
On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai nicolas@gmail.com
Yes, both of these are derived from the same source, and this source
includes the driver. In other words, if you submit a job with 10 executors
you will get back 11 for both statuses.
2014-07-28 15:40 GMT-07:00 Sung Hwan Chung coded...@cs.stanford.edu:
Do getExecutorStorageStatus and
On Mon, Jul 28, 2014 at 12:53 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
But when done processing, one would still have to pull out the wrapped
object, knowing what it was, and I don't see how to do that.
It's pretty tricky to get the level of type safety you're looking for. I
I'm trying to launch Spark with this command on AWS:
*./spark-ec2 -k keypair_name -i keypair.pem -s 5 -t c1.xlarge -r us-west-2
--hadoop-major-version=2.4.0 launch spark_cluster*
This script is erroring out with this message:
*ssh: connect to host hostname port 22: Connection refused
Error
All of the scripts we use to publish Spark releases are in the Spark
repo itself, so you could follow these as a guideline. The publishing
process in Maven is similar to in SBT:
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65
On Mon, Jul 28, 2014 at 12:39 PM,
Hi,
In order to evaluate the ML classification accuracy, I am zipping up the
prediction and test labels as follows and then comparing the pairs in
predictionAndLabel:
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
However, I am
Hi,
I'm trying to split one large multi-field text file into many single-field
text files.
My code is like this: (somewhat simplified)
final BroadcastColSchema bcSchema = sc.broadcast(schema);
final String outputPathName = env.outputPathName;
Running as Standalone Cluster. From my monitoring console:
[spark-logo-77x50px-hd.png] Spark Master at spark://101.73.54.149:7077
* URL: spark://101.73.54.149:7077 * Workers: 1 * Cores: 2 Total, 0
Used * Memory: 2.4 GB Total, 0.0 B Used * Applications: 0 Running, 24
spark.MapOutputTrackerMasterActor: Asked to send map output locations for
shuffle 0 to takes too much time, what should I do? What is the correct
configuration?
blockManager timeout if I using a small number of reduce partition.
This may occurred while the ec2 instance are not ready and ssh port not open
yet.
Please give larger time by specify -w 300. Default should be 120
Thanks,
Tracy
Sent from my iPhone
On 2014年7月29日, at 上午8:17, sparking research...@gmail.com wrote:
I'm trying to launch Spark with this command
Hi, All
Before sc.runJob invokes dagScheduler.runJob, the func performed
on the rdd is cleaned by ClosureCleaner.clearn.
Why spark has to do this? What's the purpose?
I am not sure specifically about specific purpose of this function but
Spark needs to remove elements from the closure that may be included by
default but not really needed so as to serialize it send it to executors
to operate on RDD. For example a function in Map function of RDD may
reference
I see Andrew, thanks for the explanantion.
On Tue, Jul 29, 2014 at 5:29 AM, Andrew Lee alee...@hotmail.com wrote:
I was thinking maybe we can suggest the community to enhance the Spark
HistoryServer to capture the last failure exception from the container logs
in the last failed stage?
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes in a reasonable time. I guess using torrent helps a lot in this
case.
Best regards,
Simon
--
View this message in context:
Hi,
We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing
Great! Thanks for testing the new features! -Xiangrui
On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote:
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui
On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote:
Hi,
In order to evaluate the ML classification accuracy, I am
That looks good to me since there is no Hadoop InputFormat for HDF5.
But remember to specify the number of partitions in sc.parallelize to
use all the nodes. You can change `process` to `read` which yields
records one-by-one. Then sc.parallelize(files,
numPartitions).flatMap(read) returns an RDD
Hi,
Even though hive.metastore.warehouse.dir in hive-site.xml is set to the
default user/hive/warehouse and the permissions are correct in hdfs,
HiveContext seems to be creating metastore locally instead of hdfs. After
looking into the spark code, I found the following in HiveContext.scala:
Hi
I got a error message while using Hive and SparkSQL.
This is code snippet I used.
(in spark-shell , 1.0.0)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
var sample = hive.hql(select * from sample10) // This
57 matches
Mail list logo