input data is evenly distributed to the executors.
The input data is on the HDFS, not on the spark clusters. How can I make
the data distributed to the excutors?
On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng men...@gmail.com wrote:
The weight vector is usually dense and if you have many
When you do groupBy(), it wish to load all the data into memory for best
performance, then you should specify the number of partitions carefully.
In Spark master or upcoming 1.1 release, PySpark can do external groupBy(),
it means that it will dumps the data into disks if there is not enough
parallelize uses the default Serializer (PickleSerializer) while textFile
uses UTF8Serializer.
You can get around this with index.zip(input_data._reserialize()) (or
index.zip(input_data.map(lambda x: x)))
(But if you try to just do this, you run into the issue with different
number of
Hi,
I want to manage logging of containers when I run Spark through YARN. I
checked there is a environment variable exposed to custom log4j.properties.
Setting SPARK_LOG4J_CONF to /dir/log4j.properties should ideally make
containers use
/dir/log4j.properties file for logging. This doesn't seem
Hi,
I have an rdd with n rows and m columns... but most of them are 0 ..
its as sparse matrix..
I would like to only get the non zero entries with their index?
Any equivalent python code would be
for i,x in enumerate(matrix):
for j,y in enumerate(x):
if y:
print i,j,y
What's the format of the file header? Is it possible to filter them out by
prefix string matching or regex?
On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote:
It will certainly cause bad performance, since it reads the whole content
of a large file into one value,
It's really a good question !I'm also working on it
On Wed, Jul 30, 2014 at 11:45 AM, adu dujinh...@hzduozhun.com wrote:
Hi all,
RT. I want to run a job on specific two nodes in the cluster? How to
configure the yarn? Dose yarn queue help?
Thanks
Hi all,
I'm trying to get the jobserver working with Spark 1.0.1. I've got it
building, tests passing and it connects to my Spark master (e.g.
spark://hadoop-001:7077).
I can also pre-create contexts. These show up in the Spark master console
i.e. on hadoop-001:8080
The problem is that after I
of course we can filter them out. A typical file head is as below:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus
Hi,
I took avro 1.7.7 and recompiled my distribution to be able to fix the issue
when dealing with avro GenericRecord. The issue I got was resolved. I'm
referring to AVRO-1476.
I also enabled kryo registration in SparkConf.
That said, I am still seeing a NotSerializableException for
Hi Jianshi,
I've met similar situation before.
And my solution was 'ulimit', you can use
-a to see your current settings
-n to set open files limit
(and other limits also)
And I set -n to 10240.
I see spark.shuffle.consolidateFiles helps by reusing open files.
(so I don't know to what extend
Akhil, Andry, thanks a lot for your suggestions. I will take a look to
those JVM options.
Greetings,
Juan
2014-07-28 18:56 GMT+02:00 andy petrella andy.petre...@gmail.com:
Also check the guides for the JVM option that prints messages for such
problems.
Sorry, sent from phone and don't know
:240] Master ID:
20140730-165621-1526966464-5050-23977 Hostname: CentOS-19
I0730 16:56:21.127964 23977 master.cpp:322] Master started on
192.168.3.91:5050
I0730 16:56:21.127989 23977 master.cpp:332] Master allowing unauthenticated
frameworks to register!!
I0730 16:56:21.130705 23979 master.cpp:757
I am creating an API that can access data stored using an Avro schema. The
API can only know the Avro schema at runtime when it is passed as a parm by
a user of the API. I need to initialize a custom serializer with the Avro
schema on remote worker and driver processes. I've tried to set the
Hi Andrew,
I'm not sure why an assembly would help (I don't have the Spark source code, I
have just included Spark core and Spark streaming in my dependencies in my
build file). I did try it though and the error is still occurring. I have
tried cleaning and refreshing SBT many times as
That makes sense, thanks Chris.
I'm currently reworking my code to use the newAPIHadoopRDD with an
AvroSequenceFileInputFormat (see below), but I think I'll run into the
same issue. I'll give your suggestion a try.
val avroRdd = sc.newAPIHadoopFile(fp,
Hi everybody,
I have a scenario where I would like to stream data to different
persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some
filtering and trasformation as the the data comes in.
The problem is to maintain consistency between all datastores (maybe some
operation could fail)
Hi Oleg,
Did you ever figure this out? I'm observing the same exception also in
0.9.1 and think it might be related to setting spark.speculation=true. My
theory is that multiple attempts at the same task start, the first finishes
and cleans up the _temporary directory, and then the second fails
*TL;DR 50% of the time I can't SSH into either my master or slave nodes
and have to terminate all the machines and restart the EC2 cluster setup
process.*
Hello,
I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup
script to be delicate and unpredictable in terms of reliably
Hallo,
I fear you have to write your own transaction logic for it (coordination,e.
.g. via Zookeeper, transaction log, depending on your requirements raft
/paxos etc.).
However, before you embark on this journey question yourself if your
application really needs it and what data load you expect.
Hi,
Have you checked out SchemaRDD?
There should be an examp[le of writing to Parquet files there.
BTW, FYI I was discussing this with the SparlSQL developers last week and
possibly using Apache Gora [0] for achieving this.
HTH
Lewis
[0] http://gora.apache.org
On Wed, Jul 30, 2014 at 5:14 AM,
It looks reasonable. You can also try the treeAggregate (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89)
instead of normal aggregate if the driver needs to collect a large weight
vector from each partition. -Xiangrui
On Wed,
You need to increase the wait time, (-w) the default is 120 seconds, you
may set it to a higher number like 300-400. The problem is that EC2 takes
some time to initiate the machine (which is 120 seconds sometimes.)
Thanks
Best Regards
On Wed, Jul 30, 2014 at 8:52 PM, William Cox
William,
The error you are seeing is misleading. There is no need to terminate the
cluster and start over.
Just re-run your launch command, but with the additional --resume option
tacked on the end.
As Akhil explained, this happens because AWS is not starting up the
instances as quickly as the
To add to this: for this many (= 20) machines I usually use at least
--wait 600.
On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
William,
The error you are seeing is misleading. There is no need to terminate the
cluster and start over.
Just re-run your
They are found in the executors' logs (not the worker's). In general, all
code inside foreach or map etc. are executed on the executors. You can find
these either through the Master UI (under Running Applications) or manually
on the worker machines (under $SPARK_HOME/work).
-Andrew
2014-07-30
I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster
Hi guys
I'm very new in spark land comming from old school MapReduce world. I
have no idea about scala. Does Java/Python API can compete with native
Scala API? Is Spark heavily scala centric and binding for other
languages are only for starting up and testing and serious work in spark
will
Thanks to the both of you for your inputs. Looks like I'll play with the
mapPartitions function to start porting MapReduce algorithms to Spark.
On Wed, Jul 30, 2014 at 1:23 PM, Sean Owen so...@cloudera.com wrote:
Really, the analog of a Mapper is not map(), but mapPartitions(). Instead
of:
Java is very close to Scala across the board, the only thing missing in it
right now is GraphX (which is still alpha). Python is missing GraphX, streaming
and a few of the ML algorithms, though most of them are there. So it should be
fine to start with any of them. See
I have a data file that I need to process using Spark . The file has multiple
events for different users and I need to process the events for each user in
the order it is in the file.
User 1 : Event 1
User 2: Event 1
User 1 : Event 2
User 3: Event 1
User 2: Event 2
User 3: Event 2
etc..
I want
No, it's definitely not done on the driver. It works as you say. Look
at the source code for RDD.takeOrdered, which is what top calls.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar
I am using 1.0.1 and I am running locally (I am not providing any master
URL). But the zip() does not produce the correct count as I mentioned above.
So not sure if the issue has been fixed in 1.0.1. However, instead of using
zip, I am now using the code that Sean has mentioned and am getting the
I have also used labeledPoint or libSVM format (for sparse data) for
DecisionTree. When I had categorical labels (not features), I mapped the
categories to numerical data as part of the data transformation step (i.e.
before creating the LabeledPoint).
--
View this message in context:
I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
I have an RDDString which I've repartitioned so it has 100 partitions (hoping
to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get
more than 24 tasks (my total number of
For the time being, we decided to take a different route. We created a Rest
API layer in our app and allowed SQL query passing via the Rest. Internally
we pass that query to the SparkSQL layer on the RDD and return back the
results. With this Spark SQL is supported for our RDDs via this rest API
Very cool. Glad you found a solution that works.
On Wed, Jul 30, 2014 at 1:04 PM, Venkat Subramanian vsubr...@gmail.com
wrote:
For the time being, we decided to take a different route. We created a Rest
API layer in our app and allowed SQL query passing via the Rest. Internally
we pass that
Hello,
I am attempting to install Spark 1.0.1 on a windows machine but I've been
running into some difficulties. When I attempt to run some examples I am
always met with the same response: Failed to find Spark assembly JAR. You
need to build Spark with sbt\sbt assembly before running this
This is correct behavior. Each core can execute exactly one task at a
time, with each task corresponding to a partition. If your cluster only has
24 cores, you can only run at most 24 tasks at once.
You could run multiple workers per node to get more executors. That would
give you more cores in
Kc
On Jul 30, 2014 3:55 PM, srinivas kusamsrini...@gmail.com wrote:
Hi,
I am trying to get data from mysql using JdbcRDD using code The table have
three columns
val url = jdbc:mysql://localhost:3306/studentdata
val username = root
val password = root
val mysqlrdd = new
I sometimes see that after fully caching the data, if one of the executors
fails for some reason, that portion of cache gets lost and does not get
re-cached, even though there are plenty of resources. Is this a bug or a
normal behavior (V1.0.1)?
Hi Srini,
I believe the JdbcRDD requires input splits based on ranges within the
query itself. As an example, you could adjust your query to something like:
SELECT * FROM student_info WHERE id = ? AND id = ?
Note that the values you've passed in '1, 20, 2' correspond to the lower
bound index,
Hi all,
I'm a new Spark user and I'm interested in deploying java-based Spark
applications directly from Eclipse. Specifically, I'd like to write a
Spark/java application, have it directly submitted to a spark cluster, and
then see the output of the spark execution directly in the eclipse
Thanks.
So to make sure I understand. Since I'm using a 'stand-alone' cluster, I would
set SPARK_WORKER_INSTANCES to something like 2 (instead of the default value of
1). Is that correct? But, it also sounds like I need to explicitly set a
value for SPARKER_WORKER_CORES (based on what the
ShreyanshB shreyanshpbh...@gmail.com writes:
The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb.
It'd be great if you can tell me how to configure and invoke this spark
version.
Sorry for the delay on this. Assuming you're planning to launch an EC2
The exception in Python means that the worker try to read command from
JVM, but it reach
the end of socket (socket had been closed). So it's possible that
there another exception
happened in JVM.
Could you change the log level of log4j, then check is there any
problem inside JVM?
Davies
On Wed,
Ok... but my question is why spark.shuffle.consolidateFiles is working (or
is it)? Is this a bug?
On Wed, Jul 30, 2014 at 4:29 PM, Larry Xiao xia...@sjtu.edu.cn wrote:
Hi Jianshi,
I've met similar situation before.
And my solution was 'ulimit', you can use
-a to see your current settings
47 matches
Mail list logo