Sorry for the delayed response. Please find my application attached.
On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
What is the application that you are submitting? Looks like you might have
invoked fs inside the app and then closed it within it.
Thanks
Best
Your code seems to have a lot of threads and i think you might be invoking
sc.stop before those threads get finished.
Thanks
Best Regards
On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
What is the application that you are submitting? Looks like you might have
But, somehow, if I run this application for the second time, I find that
the application gets executed and the results are out regardless of the
same errors in logs.
On Tue, Dec 2, 2014 at 2:08 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Your code seems to have a lot of threads and i think
It could be because those threads are finishing quickly.
Thanks
Best Regards
On Tue, Dec 2, 2014 at 2:19 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
But, somehow, if I run this application for the second time, I find that
the application gets executed and the results are out
Does the sparkContext shuts down itself by default even if I dont mention
specifically in my code?? Because, I ran the application without
sc.context(), still I get file system closed error along with correct
output.
On Tue, Dec 2, 2014 at 2:20 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
It
The point 4 looks weird to me, I mean if you intent to have such workflow
to run in a single session (maybe consider sessionless arch)
Is such process for each user? If it's the case, maybe finding a way to do
it for all at once would be better (more data but less scheduling).
For the micro
Hi,
I'd like to make an operation on an RDD that ONLY change the value of
some items, without make a full copy or full scan of each data.
It is useful when I need to handle a large RDD, and each time I need only
to change a little fraction of the data, and keeps other data
To be precise I want the workflow to be associated to a user, but it doesn’t
need to be run as part of or depend on a session. I can’t run scheduled jobs,
because a user can potentially upload hundreds of files which trigger a long
running batch import / update process but he could also make a
You might also have to check the Spark JobServer, it could help you at some
point.
On Tue Dec 02 2014 at 12:29:01 PM Stadin, Benjamin
benjamin.sta...@heidelberg-mobil.com wrote:
To be precise I want the workflow to be associated to a user, but it
doesn’t need to be run as part of or depend on
RDDs are immutable, so if you want to change the value of an RDD then you
have to create another RDD from it by applying some transformation.
Not sure if this is what you are looking for:
val rdd = sc.parallelize(Range(0,100))
val rdd2 = rdd.map(x = {
println(Value : + x)
You can not modify one RDD in mapPartitions due to RDD is immutable.
Once you apply transform function on RDDs, they will produce new RDDs.
If you just want to modify only a fraction of the total RDD, try to collect
the new value list to driver or use broadcast variable after each
iteration, not
Although it feels like you are copying an RDD when you map it, it is not
necessarily literally being copied. Your map function may pass through most
objects unchanged. So there may not be so much overhead as you think.
I don't think you can avoid a scan of the data unless you can somehow know
Any ideas? Anyone got the same error?
On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk alexey.romanc...@gmail.com
wrote:
Hello spark users!
I found lots of strange messages in driver log. Here it is:
2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
ERROR
Hi folks,
We have written a spark job that scans multiple hdfs directories and
perform transformations on them.
For now, this is done with a simple for loop that starts one task at
each iteration. This looks like:
dirs.foreach { case (src,dest) = sc.textFile(src).process.saveAsFile(dest) }
Hi ,
I wanted some clarity into the functioning of Filter function of RDD.
1) Does filter function scan every element saved in RDD? if my RDD
represents 10 Million rows, and if i want to work on only 1000 of them, is
there an efficient way of filtering the subset without having to scan every
Dear Spark community,
Has the pySpark saveAsSequenceFile(folder) method the ability to append
the new sequencefile into an other one or to overwrite an existing
sequencefile? If the folder already exists then I get an error message...
Thank You!
Csaba
Hi
I'm new to spark streaming.
I'm currently writing spark streaming application to standardize events coming
from Kinesis.
As part of the logic, I want to use IP to geo information library or service.
My questions:
1) If I would use some REST service for this task, do U think it would
Hi,
For your first question, I think that we can use
/sc.parallelize(rdd.take(1000))/
For your second question, I am not sure. But I don't think that we can
restricted filter within certain partition without scan every element.
Cheers
Gen
nsareen wrote
Hi ,
I wanted some clarity into the
Are we sure that exponentiating will give us the probabilities? I did some
tests by cloning the MLLIb class and adding the required code but the
calculated probabilities do not add up to 1.
I tried something like :
def predictProbs(testData: Vector): (BDV[Double], BDV[Double]) = {
val
I’ve also considered to use Kafka to message between Web UI and the pipes,
I think it will fit. Chaining the pipes together as a workflow and
implementing, managing and monitoring these long running user tasks with
locality as I need them is still causing me headache.
You can look at Apache
1. If you use some custom API library, there's a chance to end up with
Serialization errors and all, but a normal http REST api would work fine
except there could be a bit of performance lag + those api's might limit
the number of requests.
2. I would go for this approach, either i will broadcast
dirs.par.foreach { case (src,dest) =
sc.textFile(src).process.saveAsFile(dest) }
Is that sufficient for you?
On Tuesday, December 2, 2014, Anselme Vignon anselme.vig...@flaminem.com
wrote:
Hi folks,
We have written a spark job that scans multiple hdfs directories and
perform
I am trying to understand the sort algorithm that is used in RDD#sortBy. I
have read that post from Matei
http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328p14332.html
and that helps a little bit already.
I'd like to further understand the
This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote:
You may be able to construct RDDs directly from an iterator - not sure
-
Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity
Thanks Sameer and Akhil for your help.
I tried both your suggestions however, I still face the same issue. There
was indeed space in the installation path for Scala and Sbt since I had let
the defaults stay and hence the path was C:\Program Files . I
reinstalled scala and sbt in c:\ as well
Similarly, I'm having an issue with the above solution when I use the
math.min() function to add to an accumulator. I'm seeing negative overflow
numbers again.
This code works fine without the math.min() and even if I add an arbitrarily
large number like 100
// doesn't work
someRDD.foreach(x={
This is a more of a Scala question than Spark question. Which Dependency
Injection framework do you guys use for Scala when using Spark? Is
http://scaldi.org/ recommended?
Regards
Venkat
--
View this message in context:
I am running in local mode. I am using google n1-highmem-16 (16 vCPU, 104 GB
memory) machine.
I have allocated the SPARK_DRIVER_MEMORY=95g
I see Memory: 33.6 GB Used (73.7 GB Total) that the exeuctor is using.
In the log out put below, I see 33.6 gb blocks are used by 2 rdds that I
have cached.
Any suggestion on how can user with custom Hadoop jar solve this issue?
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Sunday, November 30, 2014 11:06 PM
To: Judy Nash
Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
Subject: Re: latest Spark 1.2
On Tue, Dec 2, 2014 at 11:22 AM, Judy Nash
judyn...@exchange.microsoft.com wrote:
Any suggestion on how can user with custom Hadoop jar solve this issue?
You'll need to include all the dependencies for that custom Hadoop jar
to the classpath. Those will include Guava (which is not included in
Hi,
I have 16 GB of parquet files in /tmp/logs/ folder with the following schema
request_id(String), module(String), payload(Array[Byte])
Most of my 16 GB data is the payload field, the request_id, and module
fields take less than 200 MB.
I want to load the payload only when my filter
I am running
spark 1.1.0
DSE cassandra 4.6
when I try to run the following sql statement:
val sstring = Select * from seasonality where customer_id = +
customer_id + and cat_id = + seg + and period_desc = + cDate
println(sstring = +sstring)
val rrCheckRdd =
Hi,
I am trying to run JavaWordCount without using the spark-submit script. I have
copied the source code for JavaWordCount and am using a JavaSparkContext with
the following:
SparkConf conf = new SparkConf().setAppName(JavaWordCount);
Hi,
I am trying to run JavaWordCount without using the spark-submit script. I have
copied the source code for JavaWordCount and am using a JavaSparkContext with
the following:
SparkConf conf = new SparkConf().setAppName(JavaWordCount);
Hi all, as the last stage of execution, I am writing out a dataset to disk.
Before I do this, I force the DAG to resolve so this is the only job left in
the pipeline. The dataset in question is not especially large (a few
gigabytes). During this step however, HDFS will inevitable crash. I will
Hi all,
I am new to Spark and currently I am trying to run a SparkSQL query on HBase
entity. For an entity with about 4000 rows, it will take about 12 seconds.
Is it expected? Is there any way to shorten the query process?
Here is the code snippet:
SparkConf sparkConf = new
I am happy to announce the availability of Spark 1.1.1! This is a
maintenance release with many bug fixes, most of which are concentrated in
the core. This list includes various fixes to sort-based shuffle, memory
leak, and spilling issues. Contributions from this release came from 55
developers.
Hello all,
Is there a way to load an RDD in a small driver app and connect with a JDBC
client and issue SQL queries against it? It seems the thrift server only
works with pre-existing Hive tables.
Thanks
Jim
--
View this message in context:
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark
apps in standalone mode. The default log info goes to /$spark_folder/work/.
This folder is in the 10G root fs. So it won't take long to fill up the
whole fs.
My goal is
1. move the logging location to /mnt, where we
To answer my own question,
I was declaring the accumulator incorrectly. The code should look like this:
scala import org.apache.spark.AccumulatorParam
import org.apache.spark.AccumulatorParam
scala :paste
// Entering paste mode (ctrl-D to finish)
implicit object BigIntAccumulatorParam extends
Jamal,
I have not tried this, but can you not integrate Spark SQL with your Spring
Java web app just like a standalone app? I have integrated a Scala web app
(using Play) with Spark SQL and it works.
Mohammed
From: adrian [mailto:adria...@gmail.com]
Sent: Friday, November 28, 2014 11:03 AM
To:
There is an experimental method that allows you to start the JDBC server
with an existing HiveContext (which might have registered temporary tables).
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
Thanks! I'll give it a try.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
A little bit about how to read this output. Resolution occurs from the
bottom up and when you see a tick (') it means that a field is unresolved.
So here it looks like Year_2011_Month_0_Week_0_Site is missing from from
your RDD. (We are working on more obvious error messages).
Michael
On Tue,
Hi Bharath,
You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:
Hello everyone,
Could anybody tell me how to import and call the 3rd party java classes from
inside spark?
Here's my case:
I have a jar file (the directory layout is com.xxx.yyy.zzz) which contains
some java classes, and I need to call some of them in spark code.
I used the statement import
Have you tried taking thread dumps via the UI? There is a link to do so on
the Executors' page (typically under http://driver IP:4040/exectuors.
By visualizing the thread call stack of the executors with slow running
tasks, you can see exactly what code is executing at an instant in time. If
you
I finally solved this issue. The problem was that:
1. I defined a case class with a Buffer[MyType] field.
2. I instantiated the class with the field set to the value given by an
implicit conversion from a Java list, which is supposedly a Buffer.
3. However, the underlying type of that field was
Hi everyone,
I¹ve been trying to set up Spark so that it can read data from HDFS, when
the HDFS cluster is integrated with Kerberos authentication.
I¹ve been using the Spark shell to attempt to read from HDFS, in local mode.
I¹ve set all of the appropriate properties in core-site.xml and
cat spark-env.sh
--
#!/usr/bin/env bash
export SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time
-Dspark.executor.logs.rolling.time.interval=daily
-Dspark.executor.logs.rolling.maxRetainedFiles=3
export SPARK_LOCAL_DIRS=/mnt/spark
export SPARK_WORKER_DIR=/mnt/spark
--
But
1) I can go there but none of the links are clickable
2) when I see something like 116/120 partitions succeeded in the stages ui
in the storage ui I see
NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
number of machines which will not complete
Also RDD 27 does not show up
We have a web application which talks to spark server.
This is how we have done the integration.
1) In the tomcat's classpath, add the spark distribution jar for spark code
to be available at runtime ( for you it would be Jetty).
2) In the Web application project, add the spark distribution jar
Interesting. Do you have any problems when launching in us-east-1? What is
the full output of spark-ec2 when launching a cluster? (Post it to a gist
if it’s too big for email.)
On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis dave.chal...@aistemos.com
wrote:
I've been trying to create a Spark
+Andrew
Actually I think this is because we haven't uploaded the Spark binaries to
cloudfront / pushed the change to mesos/spark-ec2.
Andrew, can you take care of this ?
On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Interesting. Do you have any problems
Hi Rod,
The purpose of introducing WAL mechanism in Spark Streaming as a general
solution is to make all the receivers be benefit from this mechanism.
Though as you said, external sources like Kafka have their own checkpoint
mechanism, instead of storing data in WAL, we can only store
1) I think using library based solution is a better idea, we used that, and it
works.2) We used broadcast variable, and it works
qinwei
From: Noam KfirDate: 2014-12-02 23:14To: user@spark.apache.orgSubject: IP to
geo information in spark streaming application
Hi
I'm new to
Bump up.
Michael Armbrust, anybody from Spark SQL team?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124p20218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
How do you solved this problem? I run the standalone application, but there
is no APPLICATION_COMPLETE file too.
On Sat, Nov 8, 2014 at 2:11 PM, Arun Ahuja aahuj...@gmail.com wrote:
We are running our applications through YARN and are only somtimes seeing
them into the History Server. Most do
hello,
im running spark on a cluster and i want to monitor how many nodes/ cores
are active in different (specific) points of the program.
is there any way to do this?
thanks,
Isca
hello,
im running spark on a cluster and i want to monitor how many nodes/ cores
are active in different (specific) points of the program.
is there any way to do this?
thanks,
Isca
Hi Isca,
I think SPM can do that for you:
http://blog.sematext.com/2014/10/07/apache-spark-monitoring/
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr Elasticsearch Support * http://sematext.com/
On Tue, Dec 2, 2014 at 11:57 PM, Isca Harmatz
62 matches
Mail list logo