Thank you, TD. This is important information for us. Will keep an eye on
that.
Cheers,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Yes, this is the limitation of the current implementation. But this will
Hi All,
I got an error while using DecisionTreeModel (my program is written in Java,
spark 1.0.1, scala 2.10.1).
I have read a local file, loaded it as RDD, and then sent to decisionTree for
training. See below for details:
JavaRDDLabeledPoint Points = lines.map(new ParsePoint()).cache();
Thanks Tathagata! I tried it, and worked out perfectly.
On Thu, Jul 17, 2014 at 6:34 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the
MapReduce. You can open connection, get all the data and buffer it, close
connection,
It is very true that making predictions in batch for all 1 million users
against the 10k items will be quite onerous in terms of computation. I have
run into this issue too in making batch predictions.
Some ideas:
1. Do you really need to generate recommendations for each user in batch?
How are
I have a standalone spark cluster and a HDFS cluster which share some of nodes.
When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS
the location for each file block in order to get a right worker node?
How about a spark cluster on Yarn?
Thank you very much!
Good to know! I am bumping the priority of this issue in my head. Thanks
for the feedback. Others seeing this thread, please comment if you think
that this is an important issue for you as well.
Not at my computer right now but I will make a Jira for this.
TD
On Jul 17, 2014 11:22 PM, Yan Fang
Hi Haopu,
Spark will ask HDFS for file block locations and try to assign tasks based
on these.
There is a snag. Spark schedules its tasks inside of executor processes
that stick around for the lifetime of a Spark application. Spark requests
executors before it runs any jobs, i.e. before it has
And you might want to apply clustering before. It is likely that every user
and every item are not unique.
Bertrand Dechoux
On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:
It is very true that making predictions in batch for all 1 million users
against the 10k
Well, for what it's worth, I found the issue after spending the whole night
running experiments;).
Basically, I needed to give a higher number of partition for the groupByKey.
I was simply using the default, which generated only 4 partitions and so the
whole thing blew up.
--
View this
Sandy,
Do you mean the “preferred location” is working for standalone cluster also?
Because I check the code of SparkContext and see comments as below:
// This is used only by YARN for now, but should be relevant to other cluster
types (Mesos,
// etc) too. This is typically
Hello,
I want to run Shark on yarn.
My environment
Shark-0.9.1.
Spark-1.0.0
hadoop-2.3.0
My first question is that: Is it possible to run shark-0.9.1 with
Spark-1.0.0 on yarn? or Shark and Spark have to be necessarily in the same
version?
For the moment, when i make a request like show
By looking at the code of JobScheduler, I find a parameter of below:
private val numConcurrentJobs =
ssc.conf.getInt(spark.streaming.concurrentJobs, 1)
private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs)
Does that mean each App can have only one active stage?
Thanks
On Fri, Jul 18, 2014 at 12:22 AM, Ankur Dave ankurd...@gmail.com wrote:
If your sendMsg function needs to know the incoming messages as well as
the vertex value, you could define VD to be a tuple of the vertex value and
the last received message. The vprog function would then store
Hi,
Yes, the error still occurs when we replace the lambdas with named
functions:
(same error traces as in previous posts)
--
View this message in context:
Hi again!
I am having problems when using GROUP BY on both SQLContext and
HiveContext (same problem).
My code (simplified as much as possible) can be seen here:
http://pastebin.com/33rjW67H
In short, I'm getting data from a Cassandra store with Datastax' new
driver (which works great by the
Hi,
We have a query with left joining and got this error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 1.0:0 failed 4 times, most recent failure: Exception failure
in TID 5 on host ip-10-33-132-101.us-west-2.compute.internal:
Hi,
in the Spark UI, one of the metrics is shuffle spill (memory). What is it
exactly? Spilling to disk when the shuffle data doesn't fit in memory I get
it, but what does it mean to spill to memory?
Thanks,
- Sebastien
Hi,
Instead of spark://10.1.3.7:7077 use spark://vmsparkwin1:7077 try this
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
spark://vmsparkwin1:7077 --executor-memory 1G --total-executor-cores 2
./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
Thanks Regards,
Meethu M
I am running my program on a spark cluster but when I look into my UI while
the job is running I see that only one worker does most of the tasks. My
cluster has one master and 4 workers where the master is also a worker.
I want my task to complete as quickly as possible and I believe that if the
Hi,Svend
Your reply is very helpful to me. I'll keep an eye on that ticket.
And also... Cheers :)
Best Regards,
Victor
--
View this message in context:
Thanks Andrew, I tried and it works.
On Fri, Jul 18, 2014 at 12:53 AM, Andrew Or and...@databricks.com wrote:
You will need to include that in the SPARK_JAVA_OPTS environment variable,
so add the following line to spark-env.sh:
export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC
This should
Thanks Tathagata,
That would be awesome if Spark streaming can support receiving rate in
general. I tried to explore the link you provided but could not find any
specific JIRA related to this? Do you have the JIRA number for this?
On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das
Speaking of this, I have another related question.
In my spark streaming job, I set up multiple consumers to receive data from
Kafka, with each worker from one partition.
Initially, Spark is intelligent enough to associate each worker to each
partition, to make data consumption distributed.
The default # of partitions is the # of cores, correct?
On 7/18/14, 10:53 AM, Yanbo Liang wrote:
check how many partitions in your program.
If only one, change it to more partitions will make the execution
parallel.
2014-07-18 20:57 GMT+08:00 Madhura das.madhur...@gmail.com
Hi,
as far as I know, rebalance is triggered from Kafka in order to distribute
partitions evenly. That is, to achieve the opposite of what you are seeing.
I think it would be interesting to check the Kafka logs for the result of
the rebalance operation and why you see what you are seeing. I know
Clusters will not be fully utilized unless you set the level of parallelism
for each operation high enough. Spark automatically sets the number of
“map” tasks to run on each file according to its size. You can pass the
level of parallelism as a second argument or set the config property
Hi
I am able to save my RDD generated to local file that are coming from Spark
SQL that are getting from Spark Streaming. If i put the steamingcontext to
10 sec the data coming in that 10 sec time window is only processed by my
sql and the data is stored in the location i specified and for next
Hello,
Just to make sure I correctly read the doc and the forums. It's my
understanding that currently in python with Spark 1.0.1 there is no way to
save my RDD to disk that I can just reload. The hadoop RDD are not yet
present in Python.
Is that correct? I just want to make sure that's the case
Nick's suggestion is a good approach for your data. The item factors to
broadcast should be a few MBs. -Xiangrui
On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote:
And you might want to apply clustering before. It is likely that every user
and every item are not
You can save RDDs to text files using RDD.saveAsTextFile and load it back using
sc.textFile. But make sure the record to string conversion is correctly
implemented if the type is not primitive and you have the parser to load them
back. -Xiangrui
On Jul 18, 2014, at 8:39 AM, Roch Denis
+1, had to learn this the hard way when some of my objects were written
as pointers, rather than translated correctly to strings :)
On 7/18/14, 11:52 AM, Xiangrui Meng wrote:
You can save RDDs to text files using RDD.saveAsTextFile and load it back using
sc.textFile. But make sure the record
Hi Ankur,
Thanks so much! :))
Yes, is possible to defining a custom partition strategy?
And, some other questions:
(2*4 cores machine, 24GB memory)
- if I load one edges file(5 GB), without any cores/partitions setting, what is
the default partition in graph construction? and how many cores
Yeah but I would still have to do a map pass with an ast.litteral_eval() for
each line, correct?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172p10179.html
Sent from the Apache Spark User List mailing list archive at
Shuffle spill (memory) is the size of the deserialized form of the data in
memory at the time when we spill it, whereas shuffle spill (disk) is the
size of the serialized form of the data on disk after we spill it. This is
why the latter tends to be much smaller than the former. Note that both
Hello Experts,
Appreciate your input highly, please suggest/ give me hint, what would be the
issue here?
Thanks and Regards,
Malligarjunan S.
On Thursday, 17 July 2014, 22:47, S Malligarjunan smalligarju...@yahoo.com
wrote:
Hello Experts,
I am facing performance problem when I use
Hi Tathagata,
On Thu, Jul 17, 2014 at 6:12 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
The RDD parameter in foreachRDD contains raw/transformed data from the
last batch. So when forearchRDD is called with the time parameter as 5:02:01
and batch size is 1 minute, then the rdd will
Thanks Nick real-time suggestion is good, will see if we can add that to our
deployment strategy and you are correct we may not need recommendation for
each user.
Will try adding more resources and broadcasting item features suggestion as
currently they don't seem to be huge.
As users and
Thanks for all your helpful replies.
Best,
Francisco
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Regularized-logistic-regression-in-python-tp9780p10184.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Agree GPUs may be interesting for this kind of massively parallel linear
algebra on reasonable size vectors.
These projects might be of interest in this regard:
https://github.com/BIDData/BIDMach
https://github.com/BIDData/BIDMat
https://github.com/dlwh/gust
Nick
On Fri, Jul 18, 2014 at 7:40
Hi Sudha,
Have you checked if the labels are being loaded correctly? It sounds like
the DT algorithm can't find any useful splits to make, so maybe it thinks
they are all the same? Some data loading functions threshold labels to
make them binary.
Hope it helps,
Joseph
On Fri, Jul 11, 2014 at
Hi all,
I'm dealing with some strange error messages that I *think* comes down
to a memory issue, but I'm having a hard time pinning it down and could
use some guidance from the experts.
I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores;
one has 16GB memory, the other
Oops, wrong link!
JIRA: https://github.com/apache/spark/pull/945/files
Github PR: https://github.com/apache/spark/pull/945/files
On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote:
Thanks Tathagata,
That would be awesome if Spark streaming can support receiving rate in
Hi Matei-
Changing to coalesce(numNodes, true) still runs all partitions on a single
node, which I verified by printing the hostname before I exec the external
process.
--
View this message in context:
Dang! Messed it up again!
JIRA: https://issues.apache.org/jira/browse/SPARK-1341
Github PR: https://github.com/apache/spark/pull/945/files
On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Oops, wrong link!
JIRA:
Would it be a reasonable use case of spark streaming to have a very large
window size (lets say on the scale of weeks). In this particular case the
reduce function would be invertible so that would aid in efficiency. I
assume that having a larger batch size since the window is so large would
also
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote:
Yes, is possible to defining a custom partition strategy?
Yes, you just need to create a subclass of PartitionStrategy as follows:
import org.apache.spark.graphx._
object MyPartitionStrategy extends PartitionStrategy {
Sorry, I didn't read your vertex replication example carefully, so my
previous answer is wrong. Here's the correct one:
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote:
I don't understand, for instance, we have 3 edge partition tables(EA: a -
b, a - c; EB: a - d, a - e;
There is no version of shark that works with spark 1.0.
More details about the path forward here:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
On Jul 18, 2014 4:53 AM, Megane1994 leumenilari...@yahoo.fr wrote:
Hello,
I want to run
Sorry for the non-obvious error message. It is not valid SQL to include
attributes in the select clause unless they are also in the group by clause
or are inside of an aggregate function.
On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com
wrote:
Hi again!
I am having
It's likely that since your UDF is a black box to hive's query optimizer
that it must choose a less efficient join algorithm that passes all
possible matches to your function for comparison. This will happen any
time your UDF touches attributes from both sides of the join.
In general you can
Can you tell us more about your environment. Specifically, are you also
running on Mesos?
On Jul 18, 2014 12:39 AM, Victor Sheng victorsheng...@gmail.com wrote:
when I run a query to a hadoop file.
mobile.registerAsTable(mobile)
val count = sqlContext.sql(select count(1) from mobile)
res5:
See the section on advanced dependency management:
http://spark.apache.org/docs/latest/submitting-applications.html
On Jul 17, 2014 10:53 PM, linkpatrickliu linkpatrick...@live.com wrote:
Seems like the mysql connector jar is not included in the classpath.
Where can I set the jar to the
You can do insert into. As with other SQL on HDFS systems there is no
updating of data.
On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Is this what you are looking for?
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html
Hi,
This is my second week of working with Spark, pardon if this is elementary
question in spark domain.
I am looking for ways to render output of Spark Streaming.
First let me describe problem set. I am monitoring (push from devices every
minute) temperature/humidity and other environmental
I'm trying to read and an Avro Sequence File using the sequenceFile method on
the spark context object and I get a NullPointerException. If I read the
file outside of Spark using AvroSequenceFile.Reader I don't have any
problems.
Anyone have success in doing this?
Below is one I typed and saw
Correction: I get a null pointer exception when I attempt to perform an
action like 'first'.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-Sequence-Files-tp10201p10202.html
Sent from the Apache Spark User List mailing list archive at
I think you probably want to use `AvroSequenceFileOutputFormat` with
`newAPIHadoopFile`. I'm not even sure that in hadoop you would use
SequenceFileInput format to read an avro sequence file
--
View this message in context:
Thanks for responding. I tried using the newAPIHadoopFile method and got an
IO Exception with the message Not a data file.
If anyone has an example of this working I'd appreciate your input or
examples.
What I entered at the repl and what I got back are below:
val myAvroSequenceFile =
If you are performing recommendations via a latent factor model then I
highly recommend you look into methods of approximate nearest neighbors.
At Spotify we batch process top N recommendations for 40M users with a
catalog of 40M items, but we avoid the naive O(n*m) process you are
describing by
Unfortunately, this is a query where we just don't have an efficiently
implementation yet. You might try switching the table order.
Here is the JIRA for doing something more efficient:
https://issues.apache.org/jira/browse/SPARK-2212
On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee
You might check out Bokeh ( http://bokeh.pydata.org http://bokeh.pydata.org
) which is a python (and other languages) system for streaming and big data
vis targeting the browser. I just gave a talk at SciPy 2014 where you can
hear more and see examples: https://www.youtube.com/watch?v=B9NpLOyp-dI
I also tried increasing --num-executors to numNodes * coresPerNode and using
coalesce(numNodes*10,true), and it still ran all the tasks on one node. It
seems like it is placing all the executors on one node (though not always
the same node, which indicates it is aware of more than one!). I'm using
You have to use `myBroadcastVariable.value` to access the broadcasted
value; see
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania
ved...@retentionscience.com wrote:
Hi All,
I am trying to broadcast a set in a
Christopher, that's really a great idea to search in latent factor space
rather than computing each entry of matrix, now the complexity of the
problem has reduced drastically from naive O(n*m). Since our data is not
that huge I will try exact nbrhood search then fallback to approximate if
that
Hi Josh,
I did make that change, however I get this error now:
568.492: [GC [PSYoungGen: 1412948K-207017K(1465088K)]
4494287K-3471149K(4960384K), 0.1280200 secs] [Times: user=0.23 sys=0.63,
real=0.13 secs]
568.642: [Full GCTraceback (most recent call last):
File stdin, line 1, in module
I'm stumped with this one. I'm using YARN on EMR to distribute my spark job.
While it seems initially, the job is starting up fine - the Spark Executor
nodes are having trouble pulling the jars from the location on hdfs that the
master just put the files on.
[hadoop@ip-172-16-2-167 ~]$
Hi Cheng Hao,
Thank you very much for your reply.
Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt
assembly/assembly, including the jar in all the workers, and copying the
hive-site.xml to spark's conf dir.
Andrew,
Yes, this works after cleaning up the .staging as you suggested. Thanks a
lot!
Jian
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/jar-changed-on-src-filesystem-tp10011p10216.html
Sent from the Apache Spark User List mailing list archive at
Hello all,
There is a bug in the spark-ec2 script (perhaps due to a change in the
Amazon AMI).
The --ebs-vol-size option directs the spark-ec2 script to add an EBS volume
of the specified size, and mount it at /vol for a persistent HDFS. To do
this, it uses mkfs.xfs which is not available
If you want to process data that spans across weeks, then it best to use a
dedicated data store (file system, sql / nosql database, etc.) that is
designed for long term data storage and retrieval. Spark Streaming is not
designed as a long term data store. Also it does not seem like you need low
Unfortunately for reasons I won't go into my options for what I can use are
limited, it was more of a curiosity to see if spark could handle a use case
like this since the functionality I wanted fit perfectly into the
reduceByKeyAndWindow frame of thinking. Anyway thanks for answering.
--
View
Hi I am getting null pointer exception while saving the data into hadoop.
code as follows.
If I change the last line to
sorted_tup.take(count.toInt).foreach { case ((a, b, c), l) =
sc.parallelize(l.toSeq).coalesce(1).saveAsTextFile(hdfsDir + a + / + b +
/ + c)} . I am able to save it , But for
Thanks a lot for reporting this. I think we just missed installing xfsprogs
on the AMI. I have a fix for this at
https://github.com/mesos/spark-ec2/pull/59.
After the pull request is merged, any new clusters launched should have
mkfs.xfs
Thanks
Shivaram
On Fri, Jul 18, 2014 at 4:56 PM, Ben
I get TD's recommendation of sharing a connection among tasks. Now, is there a
good way to determine when to close connections?
Gino B.
On Jul 17, 2014, at 7:05 PM, Yan Fang yanfang...@gmail.com wrote:
Hi Sean,
Thank you. I see your point. What I was thinking is that, do computation in
Hello
What is the order with which SparkSQL deserializes parquet fields? Is it
possible to modify it?
I am using SparkSQL to query a parquet file that consists of a lot of
fields (around 30 or so). Let me call an example table MyTable and let's
suppose the name of one of its fields is position.
Thats, a good question. My first reach is timeout. Timing out after 10s of
seconds should be sufficient. So there should be a timer in the singleton
that runs a check every second, on when the singleton was last used, and
closes the connections after a time out. Any attempts to use the connection
Thanks for your interest. I should point out that the numbers in the arXiv
paper are from GraphX running on top of a custom version of Spark with an
experimental in-memory shuffle prototype. As a result, if you benchmark
GraphX at the current master, it's expected that it will be 2-3x slower
than
Actually, let me clarify further. There are number of possibilities.
1. The easier, less efficient way is to create a connection object every
time you do foreachPartition (as shown in the pseudocode earlier in the
thread). For each partition, you create a connection, use it to push a all
the
Thanks a lot Ankur.
The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're
We're coming off a great Seattle Spark Meetup session with Evan Chan
(@evanfchan) Interactive OLAP Queries with @ApacheSpark and #Cassandra
(http://www.slideshare.net/EvanChan2/2014-07olapcassspark) at Whitepages.
Now, we're proud to announce that our next session is Spark at eBay -
80 matches
Mail list logo