Hi Devies.
Thank you for the quick answer.
I have a code like this:
sc = SparkContext(appName=TAD)
lines = sc.textFile(sys.argv[1], 1)
result = lines.map(doSplit).groupByKey().map(lambda (k,vc):
traffic_process_model(k,vc))
result.saveAsTextFile(sys.argv[2])
Can you please give short
val iter = toLocalIterator (rdd)
This is what I am doing and it says error: not found
On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wendell pwend...@gmail.com
wrote:
It looks like you are trying to directly import the toLocalIterator
function. You can't import functions, it should just appear as
Deep,
toLocalIterator is a method on the RDD class. So try this instead:
rdd.toLocalIterator()
On Fri, Nov 14, 2014 at 12:21 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
val iter = toLocalIterator (rdd)
This is what I am doing and it says error: not found
On Fri, Nov 14, 2014 at
Hi folks,
Although spark-1977 said that this problem is resolved in 1.0.2, but I will
have this problem while running the script in AWS EC2 via spark-c2.py.
I checked spark-1977 and found that twitter.chill resolve the problem in
v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6
Hi,
I am using PySpark (1.1) and I am using it for some image processing tasks.
The images (RDD) are of in the order of several MB to low/mid two digit MB.
However, when using the data and running operations on it using Spark, I
experience blowing up memory. Is there anything I can do about it? I
TJ, what was your expansion factor between image size on disk and in memory
in pyspark? I'd expect in memory to be larger due to Java object overhead,
but don't know the exact amounts you should expect.
On Fri, Nov 14, 2014 at 12:50 AM, TJ Klein tjkl...@gmail.com wrote:
Hi,
I am using
See
http://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/EmptyRDD.html
On Nov 14, 2014, at 2:09 AM, Deep Pradhan pradhandeep1...@gmail.com wrote:
How to create an empty RDD in Spark?
Thank You
If I remember correctly, EmptyRDD is private [spark]
You can create an empty RDD using the spark context:
val emptyRdd = sc.emptyRDD
-kr, Gerard.
On Fri, Nov 14, 2014 at 11:22 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
To get an empty RDD, I did this:
I have an rdd with one
Hi guys,
I have a question about how the basics of D-Streams, accumulators, failure
and speculative execution interact.
Let's say I have a streaming app that takes a stream of strings, formats
them (let's say it converts each to Unicode), and prints them (e.g. on a
news ticker). I know print()
It looks like an Scala issue. Seems like the implicit conversion to
ArrayOps does not apply if the type is Array[Nothing].
Try giving a type to the empty RDD:
val emptyRdd: RDD[Any] = sc.EmptyRDD
emptyRdd.collect.foreach(println) // prints a line return
-kr, Gerard.
On Fri, Nov 14, 2014 at
Which version are you using? You probably hit this bug
https://issues.apache.org/jira/browse/SPARK-3421 if some field name in
the JSON contains characters other than [a-zA-Z0-9_].
This has been fixed in https://github.com/apache/spark/pull/2563
On 11/14/14 6:35 PM, vdiwakar.malladi wrote:
Thanks for your response. I'm using Spark 1.1.0
Currently I have the spark setup which comes with Hadoop CDH (using cloudera
manager). Could you please suggest me, how can I make use of the patch?
Thanks in advance.
--
View this message in context:
It shows nullPointerException, your data could be corrupted? Try putting a
try catch inside the operation that you are doing, Are you running the
worker process on the master node also? If not, then only 1 node will be
doing the processing. If yes, then try setting the level of parallelism and
Hm, I'm not sure whether this is the official way to upgrade CDH Spark,
maybe you can checkout https://github.com/cloudera/spark, apply required
patches, and then compile your own version.
On 11/14/14 8:46 PM, vdiwakar.malladi wrote:
Thanks for your response. I'm using Spark 1.1.0
Currently
Hi,
I am trying to read a HDFS file from Spark scheduler code. I could find
how to write hdfs read/writes in java.
But I need to access hdfs from spark using scala. Can someone please help
me in this regard.
like this?
val file = sc.textFile(hdfs://localhost:9000/sigmoid/input.txt)
Thanks
Best Regards
On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I am trying to read a HDFS file from Spark scheduler code. I could find
how to write hdfs read/writes in java.
Can you not create SparkContext inside the scheduler code? If you are
looking just to access hdfs then you can use the following object with it,
you can create/read/write files.
val hdfs = org.apache.hadoop.fs.FileSystem.get(new
URI(hdfs://localhost:9000), hadoopConf)
Thanks
Best Regards
On
Hi, I am facing an issue as a Cloud Sysadmin , when Spark master launched
on public IPs any one who knows the URL of spark can submit the jobs to it
.
Any way/hack to have a Authn and Authz in spark . i tried to look into it
but could not find ..
any hint ?
--
Regards
Zeeshan Ali Shah
[image: Inline image 1]
Thanks
Best Regards
On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri
tri@verizonwireless.com.invalid wrote:
It should be
val file = sc.textFile(hdfs:///localhost:9000/sigmoid/input.txt)
3 “///”
Thanks
Tri
*From:* rapelly kartheek
I'll just try out with object Akhil provided.
There was no problem working in shell with sc.textFile.
Thank you Akhil and Tri.
On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
[image: Inline image 1]
Thanks
Best Regards
On Fri, Nov 14, 2014 at 9:18 PM, Bui,
How do I set the log level when running local[n]? It ignores the
log4j.properties file on my classpath.
I also tried to set the spark home dir on the SparkConfig using setSparkHome
and made sure an appropriate log4j.properties file was in a conf
subdirectory and that didn't work either.
I'm
Hi Akhil,
I face error: not found : value URI
On Fri, Nov 14, 2014 at 9:29 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
I'll just try out with object Akhil provided.
There was no problem working in shell with sc.textFile.
Thank you Akhil and Tri.
On Fri, Nov 14, 2014 at 9:21 PM,
Let's say I have to apply a complex sequence of operations to a certain RDD.
In order to make code more modular/readable, I would typically have
something like this:
object myObject {
def main(args: Array[String]) {
val rdd1 = function1(myRdd)
val rdd2 = function2(rdd1)
val rdd3 =
Actually, it looks like it's Parquet logging that I don't have control over.
For some reason the parquet project decided to use java.util logging with
its own logging configuration.
--
View this message in context:
If you use Kryo serialier, you need to register mutable.BitSet and Rating:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102
The JIRA was marked resolved because chill resolved the problem in
v0.4.0 and we have this
mapPartitions tried to hold data is memory which did not work for me..
I am doing flatMap followed by groupByKey now with HashPartitioner and
number of blocks is 60 (Based on 120 cores I am running the job on)...
Now when the shuffle size 100 GB it works fine...as flatMap shuffle goes
to 200
I'm running a local spark master (local[n]).
I cannot seem to turn off the parquet logging. I tried:
1) Setting a log4j.properties on the classpath.
2) Setting a log4j.properties file in a spark install conf directory and
pointing to the install using setSparkHome
3) Editing the
how about using fluent style of Scala programming.
On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com
wrote:
Let's say I have to apply a complex sequence of operations to a certain
RDD.
In order to make code more modular/readable, I would typically have
something like
This code executes on the driver, and an RDD here is really just a
handle on all the distributed data out there. It's a local bookkeeping
object. So, manipulation of these objects themselves in the local
driver code has virtually no performance impact. These two versions
would be about identical*.
I have an RDD x of millions of STRINGs, each of which I want to pass
through a set of filters. My filtering code looks like this:
x.filter(filter#1, which will filter out 40% of data).
filter(filter#2, which will filter out 20% of data).
filter(filter#3, which will filter out 2% of data).
This is a problem because (other than the fact that Parquet uses
java.util.logging) of a bug in Spark in the current master.
ParquetRelation.scala attempts to override the parquet logger but, at least
currently (and if your application simply reads a parquet file before it
does anything else
Just to be complete, this is a problem in Spark that I worked around and
detailed here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html
--
View this message in context:
Hi,
I tried to modify NetworkWordCount example in order to save the output to a
file.
In NetworkWordCount.scala I replaced the line
wordCounts.print()
with
wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt)
When I ran sbt/sbt package it returned the following error:
[error]
Hi,
I was wondering if the adaptive stream processing and dynamic batch
processing was available to use in spark streaming? If someone could help
point me in the right direction?
Thanks,
Josh
I see. The general known constraints on building your assembly jar for
pyspark on Yarn are:
Java 6
NOT RedHat
Maven
Some of these are documented here
http://spark.apache.org/docs/latest/building-with-maven.html (bottom).
Maybe we should make it more explicit.
2014-11-13 2:31 GMT-08:00 jamborta
Hi Niko,
It looks like you are calling a method on DStream, which does not exist.
Check out:
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams
for the method saveAsTextFiles
Harold
On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin
In the situation you show, Spark will pipeline each filter together, and
will apply each filter one at a time to each row, effectively constructing
an statement. You would only see a performance difference if the
filter code itself is somewhat expensive, then you would want to only
execute it on
Anyone want a PR?
Yes please.
The mvn build command is
mvn clean install -Pyarn -Phive -Phive-0.13.1 -Phadoop-2.4
-Djava.version=1.7 -DskipTests
I'm getting this error message:
[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first)
on project spark-hive_2.10: wrap:
Hello all. I have been running a Spark Job that eventually needs to do a large
join.
24 million x 150 million
A broadcast join is infeasible in this instance clearly, so I am instead
attempting to do it with Hash Partitioning by defining a custom partitioner as:
class
I have instrumented word count to track how many machines the code runs
on. I use an accumulator to maintain a Set or MacAddresses. I find that
everything is done on a single machine. This is probably optimal for word
count but not the larger problems I am working on.
How to a force processing to
Hi.
I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled =
true. Task never ends.
Code:
import sys
from random import random
from operator import add
partitions = 10
n = 10 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 +
Most of the information you're asking for can be found on the Spark web UI (see
here http://spark.apache.org/docs/1.1.0/monitoring.html). You can see
which tasks are being processed by which nodes.
If you're using HDFS and your file size is smaller than the HDFS block size
you will only have one
Hi All,
I'm not quite clear on whether submitting a python application to spark
standalone on ec2 is possible.
Am I reading this correctly:
*A common deployment strategy is to submit your application from a gateway
machine that is physically co-located with your worker machines (e.g.
Master
It's successful without dynamic allocation. I can provide spark log for
that scenario if it can help.
2014-11-14 21:36 GMT+02:00 Sandy Ryza sandy.r...@cloudera.com:
Hi Egor,
Is it successful without dynamic allocation? From your log, it looks like
the job is unable to acquire resources from
Hi,
If I look inside algebird Monoid implementation it uses
java.io.Serializable...
But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't
see a KryoRegistrator for CMS and HLL monoid...
In these examples we will run with Kryo serialization on CMS and HLL or
they will be java
I reworked my app using your idea of throwing the data in a map. It looks
like it should work but I'm getting some strange errors and my job gets
terminated. I get a
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered
I need continuously run multiple calculations concurrently on a cluster. They
are not sharing RDDs.
Each of the calculations needs different number of cores and memory. Also,
some of them are long running calculation and others are short running
calculation.They all need be run on regular basis
Jira: https://issues.apache.org/jira/browse/SPARK-4412
PR: https://github.com/apache/spark/pull/3271
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html
Sent from the Apache Spark User List
While testing SparkSQL on a bunch of parquet files (basically used to be a
partition for one of our hive tables), I encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
Thanks Cheng, that was helpful. I noticed from UI that only half of the
memory per executor was being used for caching, is that true? We have a 2
TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB
memory but caching still failed and what looked like from the UI was that
it
Its not recommended to have multiple spark contexts in one JVM, but you
could launch a separate JVM per context. How resources get allocated is
probably outside the scope of Spark, and more of a task for the cluster
manager.
On Fri, Nov 14, 2014 at 12:58 PM, Charles charles...@cenx.com wrote:
I
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark. What is the path/url format one uses to pull data from
there? (This is in reference to using the
https://github.com/mengxr/redshift-input-format)
That would be helpful as well. Can you confirm that when you try it with
dynamic allocation the cluster has free resources?
On Fri, Nov 14, 2014 at 12:17 PM, Egor Pahomov pahomov.e...@gmail.com
wrote:
It's successful without dynamic allocation. I can provide spark log for
that scenario if it
Does this also apply to StreamingContext ?
What issue would I have if I have 1000s of StreaminContext ?
Thanks
Tri
From: Daniil Osipov [mailto:daniil.osi...@shazam.com]
Sent: Friday, November 14, 2014 3:47 PM
To: Charles
Cc: u...@spark.incubator.apache.org
Subject: Re: Mulitple Spark Context
The cluster runs Mesos and I can see the tasks in the Mesos UI but most are
not doing much - any hints about that UI
On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
Most of the information you're asking for can be found on the Spark web UI
(see here
Thanks for your reply! Can you be more specific about the JVM? Is JVM
referring to the driver application?
if I want to create multiple sparkContext, I will need start a driver
application instance for each sparkContext?
--
View this message in context:
I had the same need as those documented back to July archived at
http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result
.
I wonder if anyone would like to share any successful stories.
Thanks,
Xiaoyan
Hey Egor,
Have you checked the AM logs? My guess is that it threw an exception or
something such that no executors (not even the initial set) have registered
with your driver. You may already know this, but you can go to the http://RM
address:8088 page and click into the application to access
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data. Xiangrui can correct me if I'm wrong
though.
On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote:
We have a bunch of data in RedShift tables that we'd like to pull in
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995.
On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote:
Hi,
I was wondering if the adaptive stream processing and dynamic batch
processing was available to use in spark streaming? If someone could help
point me
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that. Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.
We had been using 12 m3.xlarges,
I'll try this out and follow up with what I find.
On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote:
For each node, if the CSV reader is implemented efficiently, you should be
able to hit at least half of the theoretical network bandwidth, which is
about
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of
memory is used for caching. You may refer to details from here
http://spark.apache.org/docs/latest/configuration.html
On 11/15/14 5:43 AM, Sadhan Sood wrote:
Thanks Cheng, that was helpful. I noticed from UI that only half
Hi all,
I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one
of which holds the text for a sentence from a document.
# Load sentence data table
sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
sentenceRDD.take(3)
Out[20]: [Row(annotID=118,
Hi, did you try using single quote instead of double around column name? I
faced similar situation with apache phoenix.
On Saturday, November 15, 2014, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Hi all,
I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields,
If I use row[6] instead of row[text] I get what I am looking for.
However, finding the right numeric index could be a pain.
Can I access the fields in a Row of a SchemaRDD by name, so that I can
map, filter, etc. without a trial and error process of finding the right
int for the
67 matches
Mail list logo