In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually
binds a certain SQLContext in runtime, I don't think we can manipulate/share
the SchemaRDD across SQLContext Instances.
-Original Message-
From: Kevin Jung [mailto:itsjb.j...@samsung.com]
Sent: Tuesday, July
As Hao already mentioned, using 'hive' (the HiveContext) throughout would
work.
On Monday, July 28, 2014, Cheng, Hao hao.ch...@intel.com wrote:
In your code snippet, sample is actually a SchemaRDD, and SchemaRDD
actually binds a certain SQLContext in runtime, I don't think we can
I'm not sure I understand this, maybe because the context is missing.
An RDD is immutable, so there is no such thing as writing to an RDD.
I'm not sure which aspect is being referred to as single-threaded. Is
this the Spark Streaming driver?
What is the difference between streaming into Spark and
Hi all,
There is a problem we can’t resolve. We implement the OWLQN algorithm in
parallel with SPARK,
We don’t know why It is very slow in every iteration stage, but the load of
CPU and Memory of each executor are so low that it seems impossible to make
the the every step slow.
And
Just an update on this,
Looking into Spark logs seems that some partitions are not found and
recomputed. Gives the impression that those are related with the delayed
updatestatebykey calls.
I'm seeing something like:
log line 1 - Partition rdd_132_1 not found, computing it
log line N -
Hi all,
I can read in Avro files to Spark with HadoopRDD and submit the schema in
the jobConf, but with the guidance I've seen so far, I'm left with a avro
GenericRecord of Java objects without type. How do I actually use the
schema to have the types inferred?
Example:
scala
Is there any example out there for unit testing a Spark application in Java?
Even a trivial application like word count will be very helpful. I am very
new to this and I am struggling to understand how I can use JavaSpark
Context for JUnit
--
View this message in context:
I'm not sure if job adverts are allowed on here - please let me know if
not.
Otherwise, if you're interested in using Spark in an RD machine learning
project then please get in touch. We are a startup based in London.
Our data sets are on a massive scale- we collect data on over a billion
users
I've been working some on building spark blueprints, and recently tried to
generalize one for easy blueprints of spark apps.
https://github.com/jayunit100/SparkBlueprint.git
It runs the spark app's main method in a unit test, and builds in SBT.
You can easily try it out and improve on it.
Do you mind sharing more details, for example, specs of nodes and data
size? -Xiangrui
2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com:
Hi all,
There is a problem we can’t resolve. We implement the OWLQN algorithm in
parallel with SPARK,
We don’t know why It is very slow in every
Hi, I'm running a spark standalone cluster to calculate single source
shortest path.
Here is the code, VertexRDD[(String, Long)], String for the path and Long
for the distance
codes before these lines related to reading graph data from file and
building the graph.
71 val sssp =
Hi,
try this one
http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html
it’s more about fixing windows-specific issue, but code snippet gives general
idea
just run etl and check output w/ Assert(s)
On Jul 29, 2014, at 6:29 PM, soumick86 sdasgu...@dstsystems.com
You can take a look at
https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java
and model your junits based on it.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Tue, Jul 29, 2014 at 10:10 PM,
OK, I did figure this out. I was running the app (avocado) using
spark-submit, when it was actually designed to take command line arguments
to connect to a spark cluster. Since I didn't provide any such arguments, it
started a nested local Spark cluster *inside* the YARN Spark executor and so
of
Hey all,
I’m currently trying to run connected components using GraphX on a large graph
(~1.8b vertices and ~3b edges, most of them are self edges where the only edge
that exists for vertex v is v-v) on emr using 50 m3.xlarge nodes. As the
program runs I’m seeing each iteration take longer and
Development is really rapid here, that's a great thing.
Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?
--
View this message in context:
Thanks for the response... hive-site.xml is in the classpath so that doesn't
seem to be the issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p10871.html
Sent from the Apache
IScala itself seems to be a bit dead unfortunately.
I did come across this today: https://github.com/tribbloid/ISpark
On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989
ericjohnston1...@gmail.com wrote:
Hey everyone,
I know this was asked before but I'm wondering if there have since been
I'm looking for something like the ooyala spark-jobserver (
https://github.com/ooyala/spark-jobserver) that basically manages a
SparkContext for use from a REST or web application environment, but for
python jobs instead of scala.
Has anyone written something like this? Looking for a project or
The warehouse and the metastore directories are two different things. The
metastore holds the schema information about the tables and will by default
be a local directory. With javax.jdo.option.ConnectionURL you can
configure it to be something like mysql. The warehouse directory is the
default
Some people started some work on that topic using the notebook (the
original or the n8han one, cannot remember)... Some issues have ben created
already ^^
Le 29 juil. 2014 19:59, Nick Pentreath nick.pentre...@gmail.com a
écrit :
IScala itself seems to be a bit dead unfortunately.
I did come
Heya,
I would like to use countApproxDistinct in pyspark, I know that it's an
experimental method and that it is not yet available in pyspark. I started
with porting the countApproxDistinct unit-test to Python, see
https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the
results are
Hi,
I am trying to integrate Spark onto a Flume log sink and avro source. The
sink is on one machine (the application), and the source is on another. Log
events are being sent from the application server to the avro source server
(a log directory sink on the arvo source prints to verify)
The aim
Yifan LI iamyifa...@gmail.com writes:
Maybe you could get the vertex, for instance, which id is 80, by using:
graph.vertices.filter{case(id, _) = id==80}.collect
but I am not sure this is the exactly efficient way.(it will scan the whole
table? if it can not get benefit from index of
i just looked at my dependencies in sbt, and when using cdh4.5.0
dependencies i see that hadoop clients pulls in jboss netty (via zookeeper)
and asm 3.x (via jersey-server). so somehow these exclusion rules are not
working anymore? i will look into sbt-pom-reader a bit to try to understand
whats
Denis RP qq378789...@gmail.com writes:
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
stage failure: Task 6.0:4 failed 4 times, most recent failure: Exception
failure in TID 598 on host worker6.local: java.lang.NullPointerException
[error]
I am trying to run an example Spark standalone app with the following code
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkGensimLDA extends App{
val ssc=new StreamingContext(local,testApp,Seconds(5))
val
Hi Martin,
Job ads are actually not allowed on the list, but thanks for asking. Just
posting this for others' future reference.
Matei
On July 29, 2014 at 8:34:59 AM, Martin Goodson (mar...@skimlinks.com) wrote:
I'm not sure if job adverts are allowed on here - please let me know if not.
Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui
On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote:
Development is really rapid
Just realized that I was missing the JavaSparkContext in the import and after
adding it, the error is:
Exception in thread main org.apache.spark.SparkException: Job aborted due to
stage failure: Task not serializable: java.io.NotSerializableException:
java.lang.reflect.Method
at
Hi Benjamin,
I think the best bet would be to use the Avro code generation stuff to generate
a SpecificRecord for your schema and then change the reader to use your
specific type rather than GenericRecord.
Trying to read up the generic record and then do type inference and spit out a
tuple
Hi,
that quoted statement doesn't make too much sense for me, either. Maybe if
you had a link for us that shows the context (Google doesn't reveal
anything but this conversation), we could evaluate that statement better.
Tobias
On Tue, Jul 29, 2014 at 5:53 PM, Sean Owen so...@cloudera.com
sched.cpp:217] New master detected at
master@192.168.3.91:5050
I0729 18:40:50.442570 15035 sched.cpp:225] No credentials provided.
Attempting to register without authentication
I0729 18:40:50.443234 15036 sched.cpp:391] Framework registered with
20140729-174911-1526966464-5050-13758-0006
14/07/29 18:40:50
I’m in the PySpark shell and I’m trying to do this:
a =
sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json',
minPartitions=sc.defaultParallelism * 3).cache()
a.map(lambda x: len(x)).max()
My job dies with the following:
14/07/30 01:46:28 WARN TaskSetManager: Loss was
Hi all,
RT. I want to run a job on specific two nodes in the cluster? How to
configure the yarn? Dose yarn queue help?
Thanks
Hi, all
We are migrating from mapreduce to spark, and encountered a problem.
Our input files are IIS logs with file head. It's easy to get the file head
if we process only one file, e.g.
val lines = sc.textFile('hdfs://*/u_ex14073011.log')
val head = lines.take(4)
Then we can write our map
This is an interesting question. I’m curious to know as well how this
problem can be approached.
Is there a way, perhaps, to ensure that each input file matching the glob
expression gets mapped to exactly one partition? Then you could probably
get what you want using RDD.mapPartitions().
Nick
Hi,
For testing you could also just use the Kafka 0.7.2 console consumer and
pipe it's output to netcat (nc) and process that as in the example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
That worked for me.
Could you share more details about the dataset and the algorithm? For
example, if the dataset has 10M+ features, it may be slow for the driver to
collect the weights from executors (just a blind guess). -Xiangrui
On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote:
Hi, all
It will certainly cause bad performance, since it reads the whole content
of a large file into one value, instead of splitting it into partitions.
Typically one file is 1 GB. Suppose we have 3 large files, in this way,
there would only be 3 key-value pairs, and thus 3 tasks at most.
2014-07-30
Actually, it runs okay in my slaves deployed by standalone mode.
When I switch to mesos, the error just occurs.
Anyway, thanks for your reply and any ideas will help.
--
View this message in context:
I build it with sbt package, I run it with sbt run, and I do use
SparkConf.set for deployment options and external jars. It seems that
spark-submit can't load extra jars and will lead to noclassdeffounderror,
should I pack all the jars to a giant one and give it a try?
I run it on a cluster of 8
The weight vector is usually dense and if you have many partitions,
the driver may slow down. You can also take a look at the driver
memory inside the Executor tab in WebUI. Another setting to check is
the HDFS block size and whether the input data is evenly distributed
to the executors. Are the
On Mon, Jul 28, 2014 at 12:58 PM, l lishu...@gmail.com wrote:
I have a file in s3 that I want to map each line with an index. Here is my
code:
input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache()
N input_data.count()
index = sc.parallelize(range(N), 6)
Maybe mesos or spark was not configured correctly, could you check the log
files in mesos slaves?
It should log the reason when mesos can not lunch the executor.
On Tue, Jul 29, 2014 at 10:39 PM, daijia jia_...@intsig.com wrote:
Actually, it runs okay in my slaves deployed by standalone mode.
46 matches
Mail list logo