Yeah it's not there. I imagine it was simply never added, and that
there's not a good reaosn it couldn't be.
On Thu, Oct 9, 2014 at 4:53 AM, SA sadhu.a...@gmail.com wrote:
HI,
I am looking at the documentation for Java API for Streams. The scala
library has option to save file locally, but
Thats a cryptic way to say thr should be a Jira for it :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Oct 9, 2014 at 11:46 AM, Sean Owen so...@cloudera.com wrote:
Yeah it's not there. I imagine it was simply
If you are looking to eliminate duplicate rows (or similar) then you can
define a key from the data and on that key you can do reduceByKey.
Thanks
Best Regards
On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
What is your data like? Are you looking at exact matching or
Yes, I think this another operation that is not deterministic even for
the same RDD. If a partition is lost and recalculated the ordering can
be different in the partition. Sorting the RDD makes the ordering
deterministic.
On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
coded...@cs.stanford.edu
I think the question is about copying the argument. If it's an
immutable value like String, yes just return the first argument and
ignore the second. If you're dealing with a notoriously mutable value
like a Hadoop Writable, you need to copy the value you return.
This works fine although you will
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers
who expect consistent semantics between Spark and Scala.
E.g., making sure that there's no randomness what-so-ever in RDD
transformations seems critical.
There is no --hadoop-minor-version, but you can try
--hadoop-major-version=2.0.2
Does it break anything if you use 2.0.0 version of hadoop?
Thanks
Best Regards
On Wed, Oct 8, 2014 at 8:44 PM, st553 sthompson...@gmail.com wrote:
Hi,
Were you able to figure out how to choose a specific
After a bit of research, i figured out that the one of the worker was hung
on cleaning up GC and the connection usually times out since the default is
60Seconds, so i set it to a higher number and it eliminated this issue. You
may want to try this:
You cannot give a file for spark-shell. You can open the spark-shell
(./spark-shell --master=local[2]) and paste the code there and it will run
or else you have to create a jar and submit it through spark-submit or run
it independently.
Thanks
Best Regards
On Wed, Oct 8, 2014 at 11:07 AM, Sean
Can you try decreasing the level of parallelism that you are giving for
those functions? I had this issue when i gave a value 500 and it was gone
when i dropped it to 200.
Thanks
Best Regards
On Wed, Oct 8, 2014 at 9:28 AM, Andrew Ash and...@andrewash.com wrote:
Hi Meethu,
I believe you may
You must be having those hostnames in your /etc/hosts file, if you are not
able to access it using the hostnames then you won't be able access it with
the IP address either i believe.
What are you trying to do here? running your eclipse locally and connecting
to your ec2 cluster?
Thanks
Best
I've tried to add / at the end of the path, but the result was exactly the
same. I also guess that there will be some problem on the level of Hadoop - S3
comunication. Doy you know if there is some possibility of how tu run scripts
from Spark on for example different hadoom version from the
This issue is related to your cluster. Can you paste your spark-env.sh?
Also you can try starting a spark-shell like $SPARK_HOME/bin/spark-shell
--master=*spark://ip-172-31-24-183.ec2.internal:7077* and try the same code
in it.
Is this the same URI spark://ip-172-31-24-183.ec2.internal:7077
Nice to hear that your experiment is consistent to my assumption. The
current L1/L2 will penalize the intercept as well which is not idea.
I'm working on GLMNET in Spark using OWLQN, and I can exactly get the
same solution as R but with scalability in # of rows and columns. Stay
tuned!
Sincerely,
Hi,
I'm trying out the DIMSUM item similarity from github master commit
69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is:
Num items : 8860
Number of users : 5138702
Implicit 1.0 values
Running item similarity with threshold :0.5
I have a 2 slave spark cluster on EC2 with m3.xlarge (13G
Hi Saisai, thanks you for your help, all is working ok now.
Cheers!
2014-10-09 2:49 GMT+02:00 Shao, Saisai saisai.s...@intel.com:
Hi, I think you have to change the code like this to specify the type
info, like this:
* val kafkaStream: ReceiverInputDStream[(String, String)] =
I faced similar issue with wholeTextFiles function due to version
compatibility. Spark 1.0 with Hadoop 2.4.1 worked. Did you try other
function such as textFile to check if the issue is specific to
wholeTextFiles?
Spark needs to be re-compiled for different hadoop versions. However, you
can keep
Yes. I was using String array as arguments in the reduceByKey. I think String
array is actually immutable and simply returning the first argument without
cloning one should work. I will look into mapPartitions as we can have up to
40% duplicates. Will follow up on this if necessary. Thanks very
Hi,
I am parsing a csv file using Spark using the map function. One of the line
of the csv file make a task fail (then the whole job fail). Is there a way
to do some debugging to find the line which does fail ?
Best regards,
poiuytrez
--
View this message in context:
+1
Eric Friedman
On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote:
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers who
expect consistent semantics between Spark and
Hello,
I have a weird issue, this request works fine:
sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount =
200).count()
However, when I cache the table before making the request:
sqlContext.cacheTable(transactions)
sqlContext.sql(SELECT customer_id FROM transactions WHERE
Have a try catch inside the map. See the following example.
val csvRDD = myRDD.map(x = {
var index=null
try {
index = x.toString.split(,)(0)
}catch{ case e: Exception = println(Exception!! = + e) }
(index, x)
})
Thanks
Best Regards
On Thu, Oct 9, 2014
$MASTER is 'yarn-cluster' in spark-env.sh
spark-submit --driver-memory 12424m --class org.apache.spark.examples.SparkPi
/usr/lib/spark-yarn/lib/spark-examples*.jar 1000
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0006fd28,
4342677504, 0) failed; error='Cannot allocate
Arrays are not immutable and do not have the equals semantics you want to
use them as a key. Use a Scala immutable List.
On Oct 9, 2014 12:32 PM, Ge, Yao (Y.) y...@ford.com wrote:
Yes. I was using String array as arguments in the reduceByKey. I think
String array is actually immutable and
On Oct 9, 2014 10:18 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations is
Thanks for the tip !
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Debug-a-spark-task-tp16029p16035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Thank you Akhil will try this out.
We are able to access the machines using the public IP and even the private
as they are on our subnet.
Thanks
Ankur
On Oct 9, 2014 12:41 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You must be having those hostnames in your /etc/hosts file, if you are
Hey,
Regarding python libs, I'd say it's not supported out of the box, however
it must be quite easy to generate plots using jFreeChart and automatically
add 'em to the DOM.
Nevertheless, I added an extensible support for javascript manipulation of
results, using that one it's rather easy to plot
To add a bit on this one, if you look at RDD.scala in Spark code, you'll see
that both parent and firstParent methods are protected[spark].
I guess, for good reasons, that I must admit I don't understand completely,
you are not supposed to explore an RDD lineage programmatically...
I had a
Dear all,
I have a spark job with the following configuration
*val conf = new SparkConf()*
* .setAppName(My Job)*
* .set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)*
* .set(spark.kryo.registrator, value.serializer.Registrator)*
* .setMaster(local[4])*
*
please re-try with --driver-memory 10g . The default is 256m. -Xiangrui
On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox clive@rummble.com wrote:
Hi,
I'm trying out the DIMSUM item similarity from github master commit
69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is:
Num items : 8860
We have a hive deployement on which we tried running spark-sql. When we try
to do describe table_name for some of the tables, spark-sql fails with
this:
while it works for some of the other tables. Confused and not sure what's
happening here. The same describe command works in hive. Whats
You have a typo in your code at var acc:, and the map from opPart1
to opPart2 looks like a no-op, but those aren't the problem I think.
It sounds like you intend the first element of each pair to be a count
of nonzero values, but you initialize the first element of the pair to
v, not 1, in v =
If you just want the ratio of positive to all values per key (if I'm
reading right) this works
val reduced= input.groupByKey().map(grp=
grp._2.filter(v=v0).size.toFloat/grp._2.size)
reduced.foreach(println)
I don't think you need reduceByKey or combineByKey as you're not doing
anything where the
When using the activeSetOpt in GraphImpl.mapReduceTriplets(), can we expect a
performance that is only proportional to the size of the active set and
independent of the size of the original data set? Or there is still a fixed
overhead that depends on the size of the original data set?
Thank you!
Another work around would be to add the hostnames with ip addresses in all
machines /etc/hosts file
Thanks
Best Regards
On Thu, Oct 9, 2014 at 8:49 PM, Ankur Srivastava ankur.srivast...@gmail.com
wrote:
Thank you Akhil will try this out.
We are able to access the machines using the public
Hello Sean,
Thank you, but changing from v to 1 doesn't help me either.
I am trying to count the number of non-zero values using the first
accumulator.
val newlist = List ((LAX,6), (LAX,0), (LAX,7), (SFO,0), (SFO,0),
(SFO,9))
val plist = sc.parallelize(newlist)
val part1 = plist.combineByKey(
Resolution:
After realizing that the SerDe (OpenCSV) was causing all the fields to be
defined as String type, I modified the Hive load statement to use the
default serializer. I was able to modify the CSV input file to use a
different delimiter. Although, this is a workaround, I am able to proceed
Hi, all
When we use MLUtils.kfold to generate training and validation set for cross
validation
we found that there is overlapped part in two sets….
from the code, it does sampling for twice for the same dataset
@Experimental
def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed:
hi Xiangrui,
I am trying to implement the tfidf as per the instruction you sent in
your response to Jatin.
I am getting an error in idf step. Here are my steps that run till the last
line where the compile
fails.
val labeledDocs = sc.textFile(title_subcategory)
val stopwords =
in fact with --driver-memory 2G I can get it working
On Thu, Oct 9, 2014 at 6:20 PM, Xiangrui Meng men...@gmail.com wrote:
Please use --driver-memory 2g instead of --conf
spark.driver.memory=2g. I'm not sure whether this is a bug. -Xiangrui
On Thu, Oct 9, 2014 at 9:00 AM, Jaonary Rabarisoa
I'm using KafkaUtils.createStream for the input stream to pull messages from
kafka which seems to return a ReceiverInputDStream. I do not see
saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run
into this error:
saveAsNewAPIHadoopFile is not a member of
Hi,
Recently I've configured spark in cluster with zookeper.
I have 2 masters ( active/standby) and 6 workers.
I've begun my installation with samples from example directory.
Everything worked fine when I only used memory .
When I used word count example I got messages like the ones below:
I think you have not imported
org.apache.spark.streaming.StreamingContext._ ? This gets you the
implicits that provide these methods.
On Thu, Oct 9, 2014 at 8:40 PM, bdev buntu...@gmail.com wrote:
I'm using KafkaUtils.createStream for the input stream to pull messages from
kafka which seems to
I've a SchemaRDD that I want to convert to a RDD that contains String. How
do I convert the Row inside the SchemaRDD to String?
Hello all,
I wrote a blog post around the issue I reported before:
http://metabroadcast.com/blog/design-your-spark-streaming-cluster-carefully
Can I ask some feedback from who's already using Spark Streaming in
production? How do you deal with fault tolerance and scalability?
Thanks a lot for
Hello Folks:
I'm running spark job on YARN. After the execution, I would expect the
spark job to clean staging the area, but it seems every run creates a new
staging directory. Is there a way to force spark job to clean after itself?
Thanks,
Rohit
--
CONFIDENTIALITY NOTICE
NOTICE: This message
Hello Folks:
What're some best practices to debug Spark in cluster mode?
Thanks,
Rohit
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from
Hi,
I have set up Spark 1.0.2 on the cluster using standalone mode and the input is
managed by HDFS. One node of the cluster has Intel Xeon Phi 5110P coprocessor.
Is there any possibility that spark could be aware of the existence of Phi and
run job on Xeon Phi or recognize Phi as an
Hi Lang,
What special features of the Xeon Phil do you want Spark to take advantage
of?
On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr...@gmail.com wrote:
Hi,
I have set up Spark 1.0.2 on the cluster using standalone mode and the
input is managed by HDFS. One node of the cluster has Intel
I filed https://issues.apache.org/jira/browse/SPARK-3884 to address this.
-Sandy
On Thu, Oct 9, 2014 at 7:05 AM, Greg Hill greg.h...@rackspace.com wrote:
$MASTER is 'yarn-cluster' in spark-env.sh
spark-submit --driver-memory 12424m --class
org.apache.spark.examples.SparkPi
Hello Spark folks,
I am doing a simple parsing of a CSV input file, and the input file is very
large (~1GB). It seems I have a memory leak here and I am destroying my
server. After using jmap to generate a Java heap dump and using the Eclipse
Memory Analyzer, I basically learned that when I read
There is a new API called repartitionAndSortWithinPartitions() in
master, it may help in this case,
then you should do the `groupBy()` by yourself.
On Wed, Oct 8, 2014 at 4:03 PM, chinchu chinchu@gmail.com wrote:
Sean,
I am having a similar issue, but I have a lot of data for a group I
Hi,
Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once. One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.
Thanks,
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString).
Or if you want to get a particular field of the row, you can do rdd.map(row =
row(3).toString).
Matei
On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote:
I've a SchemaRDD that I want to
When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.
There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in
Did some digging in the documentation. Looks like the IDFModel.transform only
accepts RDD as an input,
and not individual elements. Is this a bug? I am saying this because
HashingTF.transform accepts both RDD as well as vector elements as its
input.
From your post replying to Jatin, looks like
This exception should be caused by another one, could you paste all of
them here?
Also, that will be great if you can provide a script to reproduce this problem.
Thanks!
On Fri, Sep 26, 2014 at 6:11 AM, Arun Ahuja aahuj...@gmail.com wrote:
Has anyone else seen this erorr in task
Could you provide a script to reproduce this problem?
Thanks!
On Wed, Oct 8, 2014 at 9:13 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
This is also happening to me on a regular basis, when the job is large with
relatively large serialized objects used in each RDD lineage. A bad thing
Thanks Davies.. I'll try it when it gets released (I am on 1.1.0
currently). For now I am using a custom partitioner with the ShuffleRDD()
to keep the same groups together, so I don't have to shuffle all data to a
single partition.
On Thu, Oct 9, 2014 at 2:34 PM, Davies Liu dav...@databricks.com
I don't think that I saw any other error message. This is all I saw.
I'm currently experimenting to see if this can be alleviated by using
HttpBroadcastFactory instead of TorrentBroadcast. So far, with
HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you
posted.
On Thu, Oct 9,
Hi Folks,
We have a spark job that is occasionally running out of memory and hanging
(I believe in GC). This is it's own issue we're debugging, but in the
meantime, there's another unfortunate side effect. When the job is killed
(most often because of GC errors), each worker attempts to kill
I have figured out why I am getting this error:
We have a lot of data in kafka and the DStream from Kafka used
MEMROY_ONLY_SER,
so once the memory is low, spark started to discard data that is needed
later ...
So once I change to MEMORY_AND_DISK_SER, the error is gone.
Tian
--
View this
Hi,
I am using Spark 1.1.0. Is there a way to get the complete tweets
corresponding to a handle (for e.g. @Delta)? I tried using the following
example that extracts just the hashtags and replaced the # with @ as
follows. I need the complete tweet and not just the tags.
// val hashTags =
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations is as follows – unless otherwise specified, I am
not caching:
Map a
Hi,
I have a setting where data arrives in Kafka and is stored to HDFS from
there (maybe using Camus or Flume). I want to write a Spark Streaming app
where
- first all files in a that HDFS directory are processed,
- and then the stream from Kafka is processed, starting
with the first item
I am writing a Spark job to persist data using HiveContext so that it can
be accessed via the JDBC Thrift server. Although my code doesn't throw an
error, I am unable to see my persisted data when I query from the Thrift
server.
I tried three different ways to get this to work:
1)
val
Hi all,
I'm confused about Executor and BlockManager, why they have different
memory.
14/10/10 08:50:02 INFO AppClient$ClientActor: Executor added:
app-20141010085001-/2 on worker-20141010004933-brick6-35657
(brick6:35657) with 6 cores
14/10/10 08:50:02 INFO SparkDeploySchedulerBackend:
Actually, it looks like even when the job shuts down cleanly, there can be
a few minutes of overlap between the time the next job launches and the
first job actually terminates it's process. Here's some relevant lines
from my log:
14/10/09 20:49:20 INFO Worker: Asked to kill executor
Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/
https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK%2Fsa=Dsntz=1usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ
On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan chinn...@gmail.com wrote:
Hi,
I just noticed the
Hi Pierre,
I'm setting parquet (and hdfs) block size like follows:
val ONE_GB = 1024 * 1024 * 1024
sc.hadoopConfiguration.setInt(dfs.blocksize, ONE_GB)
sc.hadoopConfiguration.setInt(parquet.block.size, ONE_GB)
Here, sc is a reference to the spark context. I've tested this and it
Filed https://issues.apache.org/jira/browse/SPARK-3891
Thanks,
Anand Mohan
On Thu, Oct 9, 2014 at 7:13 PM, Michael Armbrust mich...@databricks.com
wrote:
Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/
Hi all,
What tools should I use to benchmark SPARK applications?
BR,
Theo
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
How can I get figures like those in the Evaluation part of the following
paper?
http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf
在 10/10/2014 10:35 AM, Theodore Si 写道:
Hi all,
What tools should I use to benchmark SPARK applications?
BR,
Theo
*You can try https://github.com/databricks/spark-perf
https://github.com/databricks/spark-perf*
Thanks Sean, but I'm importing org.apache.spark.streaming.
StreamingContext._
Here are the spark imports:
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
val stream
What can I get from it?
Can you show me some results please?
在 10/10/2014 10:46 AM, 牛兆捷 写道:
*You can try https://github.com/databricks/spark-perf*
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Sean,
Thank you. It works. But I am still confused about the function. Can you
kindly throw some light on it?
I was going through the example mentioned in
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
Is there any better source through which I can learn
Although caching is synonymous with persisting in memory, you can also
just persist the result (partially) on disk. At least you would use as
much RAM as you can.
Obviously that require re-reading the RDD (partially) from HDFS, and
the point is avoiding reading data from HDFS several times. But
Your RDD does not contain pairs, since you .map(_._2) (BTW that can
just be .values). Hadoop files means SequenceFiles and those
store key-value pairs. That's why the method only appears for
RDD[(K,V)].
On Fri, Oct 10, 2014 at 3:50 AM, Buntu Dev buntu...@gmail.com wrote:
Thanks Sean, but I'm
I'm getting DFS closed channel exception every now and then when I run
checkpoint. I do checkpointing every 15 minutes or so. This happens usually
after running the job for 1~2 hours. Anyone seen this before?
Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times,
most recent
It's the exact same reason you wrote:
(acc: (Int, Int), v) = ( if(v 0) acc._1 + 1 else acc._1, acc._2 + 1),
right? the first function establishes an initial value for a count.
The value is either (0,1) or (1,1) depending on whether the value is 0
or not.
You're otherwise using the method just
I need to implement following logic in a spark streaming app:
for the incoming dStream, do some transformation, and invoke
updateStateByKey to
update state object for each key (mark data entries that are updated as
dirty for next
step), then let state objects produce event(s) based (based on
I doubt Spark has such a ability which can arrange the order of task
execution.
You could try from these aspects.
1. Write your partitioner to group your data.
2. Sort elements in partitions e.g. with row index.
3. Control the order of incoming outcome obtained from Spark at your
application.
xj
Since an RDD doesn't have any ordering guarantee to begin with, I
don't think there is any guarantee about the order in which data is
encountered. It can change when the same RDD is reevaluated even.
As you say, your scenario 1 is about the best you can do. You can
achieve this if you can define
85 matches
Mail list logo