1. No.
2. The seed per partition is fixed. So it should generate
non-overlapping subsets.
3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
Best,
Xiangrui
On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, all
When we use MLUtils.kfold to generate training
Maybe this version is easier to use:
plist.mapValues((v) = (if (v 0) 1 else 0, 1)).reduceByKey((x, y) =
(x._1 + y._1, x._2 + y._2))
It has similar behavior with combineByKey(), will by faster than
groupByKey() version.
On Thu, Oct 9, 2014 at 9:28 PM, HARIPRIYA AYYALASOMAYAJULA
You can convert this ReceiverInputDStream
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream
into PairRDDFuctions
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
and call the
Hi
I have written a few extensions for sparkSQL (for version 1.1.0) and I am
trying to deploy my new jar files (one for catalyst and one for sql/core) on
ec2.
My approach was to create a new spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar
that merged the contents of the old one with the
You could be hitting this issue
https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can
try the following workarounds:
sc.set(spark.core.connection.ack.wait.timeout,600)
sc.set(spark.akka.frameSize,50)
Also reduce the number of partitions, you could be hitting the kernel's
ulimit.
Can you paste your spark-env.sh file? Looks like you have it misconfigured.
Thanks
Best Regards
On Fri, Oct 10, 2014 at 1:43 AM, Morbious knowledgefromgro...@gmail.com
wrote:
Hi,
Recently I've configured spark in cluster with zookeper.
I have 2 masters ( active/standby) and 6 workers.
I've
Hello spark users and developers!
I am using hdfs + spark sql + hive schema + parquet as storage format. I
have lot of parquet files - one files fits one hdfs block for one day. The
strange thing is very slow first query for spark sql.
To reproduce situation I use only one core and I have 97sec
You could try setting -Xcomp for executors to force JIT compilation
upfront. I don't know if it's a good idea overall but might show
whether the upfront compilation really helps. I doubt it.
However is this almost surely due to caching somewhere, in Spark SQL
or HDFS? I really doubt hotspot makes
Hey Sean and spark users!
Thanks for reply. I try -Xcomp right now and start time was about few
minutes (as expected), but I got first query slow as before:
Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 30 columns in 12897
Can you try checking whether the table is being cached? You can use isCached
method. More details are here -
http://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/sql/SQLContext.html
--
View this message in context:
Hi all,
I want to use two nodes for test, one as master, the other worker.
Can I submit the example application included in Spark source code
tarball on master to let it run on the worker?
What should I do?
BR,
Theo
-
To
Hi
Could it be due to GC ? I read it may happen if your program starts with
a small heap. What are your -Xms and -Xmx values ?
Print GC stats with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Guillaume
Hello spark users and developers!
I am using hdfs + spark sql + hive schema +
hi Diego,
I have the same problem.
// reduce by key in the first window
val *w1* = *one*.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
w1.count().print()
//reduce by key in the second window based on the results of the first
window
val *w2* = *w1*.reduceByKeyAndWindow(_ + _,
Hi,
I am trying to submit a Spark job on Mesos using spark-submit from my
Mesos-Master machine.
My SPARK_HOME = /vol1/spark/spark-1.0.2-bin-hadoop2
I have uploaded the spark-1.0.2-bin-hadoop2.tgz to hdfs so that the mesos
slaves can download it to invoke the Mesos Spark backend executor.
But
This is how the spark-cluster looks like
So your driver program (example application) can be ran on the master (or
anywhere which has access to the master - clustermanager) and the workers
will execute it.
Thanks
Best Regards
On Fri, Oct 10, 2014 at 2:47 PM, Theodore Si sjyz...@gmail.com
Hi Mohammed,
Would you mind to share the DDL of the table |x| and the complete
stacktrace of the exception you got? A full Spark shell session history
would be more than helpful. PR #2084 had been merged in master in Aug,
and timestamp type is supported in 1.1.
I tried the following
Hi Poiuytrez, what version of Spark are you using? Exception details
like stacktrace are really needed to investigate this issue. You can
find them in the executor logs, or just browse the application
stderr/stdout link from Spark Web UI.
On 10/9/14 9:37 PM, poiuytrez wrote:
Hello,
I have a
Which version are you using? Also |.saveAsTable()| saves the table to
Hive metastore, so you need to make sure your Spark application points
to the same Hive metastore instance as the JDBC Thrift server. For
example, put |hive-site.xml| under |$SPARK_HOME/conf|, and run
|spark-shell| and
Sorry, but your solution doesn't work.
I can see on my master port 7077 open and listening and connected workers
but I don't understand why it's trying to connect
itself ...
= Master is running on the specific host
netstat -at | grep 7077
You will get something similar to:
tcp0 0
Thank you - I will try this. If I drop the partition count am I not more
likely to hit memory issues? Especially if the dataset is rather large?
On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You could be hitting this issue
https://issues.apache.org/jira/browse/SPARK-3633
I am using the python api. Unfortunately, I cannot find the isCached method
equivalent in the documentation:
https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext
section.
--
View this message in context:
Hi Cheng,
I am using Spark 1.1.0.
This is the stack trace:
14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID
2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long
cannot be cast to java.lang.Integer
For weeks, I've been using the following trick to successfully disable log4j in
the spark-shell when running a cluster on ec2 that was started by the Spark
provided ec2 scripts.
cp ./conf/log4j.properties.template ./conf/log4j.properties
I then change log4j.rootCategory=INFO to
But I cannot do this via using
./bin/run-example SparkPi 10
right?
On Fri, Oct 10, 2014 at 6:04 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
This is how the spark-cluster looks like
So your driver program (example application) can be ran on the master (or
anywhere which has access to
Should I pack the example into a jar file and submit it on master?
On Fri, Oct 10, 2014 at 9:32 PM, Theodore Si sjyz...@gmail.com wrote:
But I cannot do this via using
./bin/run-example SparkPi 10
right?
On Fri, Oct 10, 2014 at 6:04 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
This
Yes, you can run it with --master=spark://your-spark-uri:7077 i believe.
Thanks
Best Regards
On Fri, Oct 10, 2014 at 7:03 PM, Theodore Si sjyz...@gmail.com wrote:
Should I pack the example into a jar file and submit it on master?
On Fri, Oct 10, 2014 at 9:32 PM, Theodore Si sjyz...@gmail.com
spark-submit --class “Classname --master yarn-cluster
jarfile(withcomplete path)
This should work.
On Fri, Oct 10, 2014 at 8:36 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yes, you can run it with --master=spark://your-spark-uri:7077 i believe.
Thanks
Best Regards
On Fri, Oct 10,
Hello,
I was wondering on what does the Spark accumulator do under the covers. I’ve
implemented my own associative addInPlace function for the accumulator, where
is this function being run? Let’s say you call something like myRdd.map(x =
sum += x) is “sum” being accumulated locally in any way,
Thanks, Xiangrui,
I found the reason of overlapped training set and test set
….
Another counter-intuitive issue related to
https://github.com/apache/spark/pull/2508
Best,
--
Nan Zhu
On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote:
1. No.
2. The seed per partition
If you use parallelize, the data is distributed across multiple nodes
available and sum is computed individually within each partition and later
merged. The driver manages the entire process. Is my understanding correct?
Can someone please correct me if I am wrong?
On Fri, Oct 10, 2014 at 9:37
spark-defaults.conf
spark.executor.uri
hdfs://:9000/user//spark-1.1.0-bin-hadoop2.4.tgz
From: Bijoy Deb [mailto:bijoy.comput...@gmail.com]
Sent: den 10 oktober 2014 11:59
To: user@spark.apache.org
Subject: Spark on Mesos Issue - Do I need to install Spark on Mesos slaves
Hi,
I
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to break
MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer
nodes. There's a detailed writeup at
Awesome news Matei !
Congratulations to the databricks team and all the community members...
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which
Brilliant stuff ! Congrats all :-)
This is indeed really heartening news !
Regards,
Mridul
On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which
Hi,
I am exploring SparkSQL 1.1.0, I have a problem on LEFT JOIN.
Here is the request:
select * from customer left join profile on customer.account_id =
profile.account_id
The two tables' schema are shown as following:
// Table: customer
root
|-- account_id: string (nullable = false)
|--
PySpark definetly works for me in ipython notebook. A good way to debug is
do setMaster(local) in your python sc object, see if that works. Then
from there, modify it to point to the real spark server.
Also, I added a hack where i did sys.path.insert the path to pyspark in my
python note book
Hey Larry,
I have been trying to figure this out for standalone clusters as well.
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-Block-Manager-td12833.html
has an answer as to what block manager is for.
From the documentation, what I understood was if you assign X GB to each
Great! Congratulations!
--
Nan Zhu
On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
Brilliant stuff ! Congrats all :-)
This is indeed really heartening news !
Regards,
Mridul
On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com
I haven't seen this at all since switching to HttpBroadcast. It seems
TorrentBroadcast might have some issues?
On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
I don't think that I saw any other error message. This is all I saw.
I'm currently experimenting to
Wonderful !!
On 11 Oct, 2014, at 12:00 am, Nan Zhu zhunanmcg...@gmail.com wrote:
Great! Congratulations!
--
Nan Zhu
On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
Brilliant stuff ! Congrats all :-)
This is indeed really heartening news !
Regards,
Mridul
Hi,
I was facing GC overhead errors while executing an application with 570MB
data(with rdd replication).
In order to fix the heap errors, I repartitioned the rdd to 10:
val logData = sc.textFile(hdfs:/text_data/text
data.txt).persist(StorageLevel.MEMORY_ONLY_2)
val
Un-needed checkpoints are not getting automatically deleted in my
application.
I.e. the lineage looks something like this and checkpoints simply
accumulate in a temporary directory (every lineage point, however, does zip
with a globally permanent):
PermanentRDD:Global zips with all the
Great stuff. Wonderful to see such progress in so short a time.
How about some links to code and instructions so that these benchmarks can
be reproduced?
Regards,
- Steve
From: Debasish Das debasish.da...@gmail.com
Date: Friday, October 10, 2014 at 8:17
To: Matei Zaharia
Hi Areg,
Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators
val sum = sc.accumulator(0) // accumulator created from an initial value
in the driver
The accumulator variable is created in the driver. Tasks running on the
cluster can then add to it. However, they
Hi,
After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed
most of the time (but not always) in yarn-cluster mode (but not in yarn-client
mode).
Here is my configuration:
* spark-1.1.0
* hadoop-2.2.0
And the hadoop.tmp.dir definition in the hadoop core-site.xml
Hi
Can you try
select birthday from customer left join profile on customer.account_id =
profile.account_id
to see if the problems remains on your entire data?
Thanks,
Liquan
On Fri, Oct 10, 2014 at 8:20 AM, invkrh inv...@gmail.com wrote:
Hi,
I am exploring SparkSQL 1.1.0, I have a problem
I have actually had the same problem. spark.executor.uri on HDFS did not work
so I had to put it in a local folder
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mesos-Issue-Do-I-need-to-install-Spark-on-Mesos-slaves-tp16129p16165.html
Sent from
Thank you guys!
It was very helpful and now I understand it better.
On Fri, Oct 10, 2014 at 1:38 AM, Davies Liu dav...@databricks.com wrote:
Maybe this version is easier to use:
plist.mapValues((v) = (if (v 0) 1 else 0, 1)).reduceByKey((x, y) =
(x._1 + y._1, x._2 + y._2))
It has
Maybe, TorrentBroadcast is more complicated than HttpBroadcast, could
you tell us
how to reproduce this issue? That will help us a lot to improve
TorrentBroadcast.
Thanks!
On Fri, Oct 10, 2014 at 8:46 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
I haven't seen this at all since switching
Hi Chen,
Thanks for looking into this.
It looks like the bug may be in the Spark Cassandra connector code. Table x is
a table in Cassandra.
However, while trying to troubleshoot this issue, I noticed another issue. This
time I did not use Cassandra; instead created a table on the fly. I am not
How do you increase the spark block manager timeout?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Akka-disassociation-on-Java-SE-Embedded-tp6266p16176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
https://github.com/CodingCat/spark/commit/c5cee24689ac4ad1187244e6a16537452e99e771
--
Nan Zhu
On Friday, October 10, 2014 at 4:31 PM, bhusted wrote:
How do you increase the spark block manager timeout?
--
View this message in context:
I'm trying to broadcast an accumulator I generated earlier in my app. However I
get a nullpointer exception whenever I reference the value.
// The start of my accumulator generation
LookupKeyToIntMap keyToIntMapper = new LookupKeyToIntMap();
Thanks. I made the change and ran the code. But I dont get any tweets for my
handle, although I do see the tweets when I search for it on twitter. Does
Spark allow us to get the tweets from the past (say the last 100 tweets?
tweets that appeared in the last 10 minutes)?
thanks
--
View this
Hi Folks,
I am seeing some strange behavior when using the Spark Kafka connector in
Spark streaming.
I have a Kafka topic which has 8 partitions. I have a kafka producer that
pumps some messages into this topic.
On the consumer side I have a spark streaming application that that has 8
executors
Would you mind sharing the code leading to your createStream? Are you also
setting group.id?
Thanks,
Sean
On Oct 10, 2014, at 4:31 PM, Abraham Jacob abe.jac...@gmail.com wrote:
Hi Folks,
I am seeing some strange behavior when using the Spark Kafka connector in
Spark streaming.
I
Sure... I do set the group.id for all the consumers to be the same. Here is
the code ---
SparkConf sparkConf = new
SparkConf().setMaster(yarn-cluster).setAppName(Streaming WordCount);
sparkConf.set(spark.shuffle.manager, SORT);
sparkConf.set(spark.streaming.unpersist, true);
How long do you let the consumers run for? Is it less than 60 seconds by
chance? auto.commit.interval.ms defaults to 6 (60 seconds). If so that
may explain why you are seeing that behavior.
Cheers,
Sean
On Oct 10, 2014, at 4:47 PM, Abraham Jacob
Hi folks,
I have just upgraded to Spark 1.1.0, and try some examples like:
./run-example SparkPageRank pagerank_data.txt 5
It turns out that Spark keeps trying to connect to my name node and read the
file from HDFS other than local FS:
Client: Retrying connect to server:
We have our own customization on top of parquet serde that we've been using
for hive. In order to make it work with spark-sql, we need to be able to
re-build spark with this. It'll be much easier to rebuild spark with this
patch once I can find the sources for org.spark-project.hive. Not sure
Thank you! I was looking for a config variable to that end, but I was
looking in Spark 1.0.2 documentation, since that was the version I had
the problem with. Is this behavior documented in 1.0.2's documentation?
Evan
On 10/09/2014 04:12 PM, Davies Liu wrote:
When you call rdd.take() or
Hi
I am running spark on an ec2 cluster. I need to update python to 2.7. I have
been following the directions on
http://nbviewer.ipython.org/gist/JoshRosen/6856670
https://issues.apache.org/jira/browse/SPARK-922
I noticed that when I start a shell using pyspark, I correctly got
python2.7, how
Hi Abraham,
You are correct, the “auto.offset.reset“ behavior in KafkaReceiver is different
from original Kafka’s semantics, if you set this configure, KafkaReceiver will
clean the related immediately, but for Kafka this configuration is just a hint
which will be effective only when offset is
Thanks Jerry, So, from what I can understand from the code, if I leave out
auto.offset.reset, it should theoretically read from the last commit
point... Correct?
-abe
On Fri, Oct 10, 2014 at 5:40 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Abraham,
You are correct, the
Please add the Maryland Spark meetup to the Spark website
http://www.meetup.com/Apache-Spark-Maryland/
Thanks
Brian Husted
This jira and comment sums up the issue:
https://issues.apache.org/jira/browse/SPARK-2492?focusedCommentId=14069708page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14069708
Basically the offset param was renamed and had slightly different semantics
between kafka 0.7
Hi abe,
You can see the details in KafkaInputDStream.scala, here is the snippet
// When auto.offset.reset is defined, it is our responsibility to try and
whack the
// consumer group zk node.
if (kafkaParams.contains(auto.offset.reset)) {
Ah I see... much clearer now...
Because auto.offset.reset will trigger KafkaReciver to delete the ZK
metadata; when the control passes over to Kafka consumer API it will see
that there is no offset available for the partition. This then will trigger
the smallest or largest logic to execute in
Hi,
Let's say that I managed to port Spark from TCP/IP to RDMA.
What tool or benchmark can I use to test the performance improvement?
BR,
Theo
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Thanks @category_theory, the post was of great help!!
I had to learn a few thing before I could understand it completely.
However, I am facing the issue of partitioning the data (using partitionBy)
without providing a hardcoded value for number of partitions. The
partitions need to be driven by
Added you, thanks! (You may have to shift-refresh the page to see it updated).
Matei
On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com
wrote:
Please add the Boulder-Denver Spark meetup group to the list on the website.
Hi Akhil - I tried your suggestions and tried varying my partition sizes.
Reducing the number of partitions led to memory errors (presumably - I saw
IOExceptions much sooner).
With the settings you provided the program ran for longer but ultimately
crashes in the same way. I would like to
Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been
I would also be interested in knowing more about this. I have used the
cloudera manager and the spark resource interface (clientnode:4040) but
would love to know if there are other tools out there - either for post
processing or better observation during execution.
On Oct 9, 2014 4:50 PM, Rohit
Hmm, there is a “T” in the timestamp string, which makes the string not
a valid timestamp string representation. Internally Spark SQL uses
|java.sql.Timestamp.valueOf| to cast a string to a timestamp.
On 10/11/14 2:08 AM, Mohammed Guller wrote:
scala rdd.registerTempTable(x)
scala val sRdd
Hi all,
I'm playing with Spark currently as a possible solution at work, and I've
been recently working out a rough correlation between our input data size
and RAM needed to cache an RDD that will be used multiple times in a job.
As part of this I've been trialling different methods of
This is some kind of implementation details, so not documented :-(
If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.
Davies
On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
Thank you! I was looking for a config variable to
77 matches
Mail list logo