Hi all,
Thank you for the reply. Is there any example of spark running in client
mode with spray ? I think, I will choose this approach.
On Tue, Jun 24, 2014 at 4:55 PM, Koert Kuipers ko...@tresata.com wrote:
run your spark app in client mode together with a spray rest service, that
the
Hi all,
I'm trying to use spark sql to store data in parquet file. I create the
file and insert data into it with the following code :
*val conf = new SparkConf().setAppName(MCT).setMaster(local[2])
val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)
rdd.coalesce() will take effect:
rdd.coalesce(1, true).saveAsTextFile(save_path)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-make-saveAsTextFile-NOT-split-output-into-multiple-file-tp8129p8244.html
Sent from the Apache Spark User List
Yeah I agree with Koert, it would be the lightest solution. I have
used it quite successfully and it just works.
There is not much spark specifics here, you can follow this example
https://github.com/jacobus/s4 on how to build your spray service.
Then the easy solution would be to have a
Sorry I just realize that start-slave is for a different task. Please close
this
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html
Sent from the Apache Spark User List mailing list
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster,
each with 15G memory, the master page looks all normal:
URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077
Workers: 1
Cores: 2 Total, 2 Used
Memory: 13.9 GB Total, 512.0 MB Used
Applications: 1 Running, 0
Totally agree, also there is a class 'SparkSubmit' you can call directly to
replace shellscript
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html
Sent from the Apache Spark User List mailing list archive at
Hi Imk,
I am not aware of any classifier in MLLib that accept nominal type of data.
They do accept RDD of LabeledPoints, which are label + vector of Double. So,
you'll need to convert nominal to double.
Best regards, Alexander
-Original Message-
From: lmk
According to „DataStax Brings Spark To Cassandra“ press realese:
„DataStax has partnered with Databricks, the company founded by the creators
of Apache Spark, to build a supported, open source integration between the
two platforms. The partners expect to have the integration ready by this
summer.“
Hi, Robert --
I wonder if this is an instance of SPARK-2075:
https://issues.apache.org/jira/browse/SPARK-2075
-- Paul
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
On Wed, Jun 25, 2014 at 6:28 AM, Robert James srobertja...@gmail.com
wrote:
On 6/24/14, Robert James
Hi,
I am writing a standalone Spark program that gets its data from Cassandra.
I followed the examples and created the RDD via the newAPIHadoopRDD() and
the ColumnFamilyInputFormat class.
The RDD is created, but I get a NotSerializableException when I call the
RDD's .groupByKey() method:
public
The behavior you're seeing is by design, and it is VERY IMPORTANT to
understand why this happens because it can cause unexpected behavior in
various ways. I learned that the hard way. :-)
Spark collapses multiple transforms into a single stage wherever possible
(presumably for performance). The
To add Spark to a SBT project, I do:
libraryDependencies += org.apache.spark %% spark-core % 1.0.0
% provided
How do I make sure that the spark version which will be downloaded
will depend on, and use, Hadoop 2, and not Hadoop 1?
Even with a line:
libraryDependencies += org.apache.hadoop %
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % versionSpark % provided
exclude(org.apache.hadoop, hadoop-client)
org.apache.hadoop % hadoop-client % versionHadoop % provided
)
On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com
wrote:
To add Spark to a SBT
Hi Matei,
Sailthru is also using Spark. Could you please add us to the Powered By
Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page
when you have a chance?
Organization Name: Sailthru
URL: www.sailthru.com
Short Description: Our data science platform uses Spark to
lately i am seeing a lot of this warning in graphx:
org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two
VertexPartitions with different indexes is slow.
i am using Graph.outerJoinVertices to join in data from a regular RDD (that
is co-partitioned). i would like this operation to
Yep exactly! I’m not sure how complicated it would be to pull off. If someone
wouldn’t mind helping to get me pointed in the right direction I would be happy
to look into and contribute this functionality. I imagine this would be
implemented in the scheduler codebase and there would be some
Thanks Daniel and Nicholas for the helpful responses. I'll go with
coalesce(shuffle = true) and see how things go.
On Wed, Jun 25, 2014 at 8:19 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
The behavior you're seeing is by design, and it is VERY IMPORTANT to
understand why this happens
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23).
I'm trying to execute the following code:
import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val table =
sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)
According to
http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.0
, spark depends on Hadoop 1.0.4. What about the versions of Spark that
work with Hadoop 2? Do they also depend on Hadoop 1.0.4?
How does everyone handle this?
Hi durin,
I just tried this example (nice data, by the way!), *with each JSON
object on one line*, and it worked fine:
scala rdd.printSchema()
root
|-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
||-- friends:
Hi Sophia, did you ever resolve this?
A common cause for not giving resources to the job is that the RM cannot
communicate with the workers.
This itself has many possible causes. Do you have a full stack trace from
the logs?
Andrew
2014-06-13 0:46 GMT-07:00 Sophia sln-1...@163.com:
With the
Hi,
(My excuses for the cross-post from SO)
I'm trying to create Cassandra SSTables from the results of a batch
computation in Spark. Ideally, each partition should create the SSTable for
the data it holds in order to parallelize the process as much as possible
(and probably even stream it to
Hi Zongheng Yang,
thanks for your response. Reading your answer, I did some more tests and
realized that analyzing very small parts of the dataset (which is ~130GB in
~4.3M lines) works fine.
The error occurs when I analyze larger parts. Using 5% of the whole data,
the error is the same as
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat.
An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html
Using it with Spark will be similar to the examples:
Is it possible you have blank lines in your input? Not that this should be
an error condition, but it may be what's causing it.
On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote:
Hi Zongheng Yang,
thanks for your response. Reading your answer, I did some more tests and
Thanks Anwar.
On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal anriza...@gmail.com wrote:
On Tue, Jun 17, 2014 at 5:39 PM, Chen Song chen.song...@gmail.com wrote:
Hey
I am new to spark streaming and apologize if these questions have been
asked.
* In StreamingContext, reduceByKey() seems
Thanks Nick.
We used the CassandraOutputFormat through Calliope. The Calliope API makes
the CassandraOutputFormat quite accessible and is cool to work with. It
worked fine at prototype level, but we had Hadoop version conflicts when we
put it in our Spark environment (Using our Spark assembly
Interesting question on Stack Overflow:
http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles
Is it possible to read gzipped files using wholeTextFiles()? Alternately,
is it possible to read the source file names using textFile()?
--
View this
Is there a easy way to do semi join in spark streaming?
Here is my problem briefly, I have a DStream that will generate a set of
values. I would like to check the existence in this set in other DStreams.
Is there a easy and standard way to model this problem. If not, can I write
spark streaming
Expanded to 4 nodes and change the workers to listen to public DNS, but still
it shows the same error (which is obviously wrong). I can't believe I'm the
first to encounter this issue.
--
View this message in context:
Right, ok.
I can't say I've used the Cassandra OutputFormats before. But perhaps if
you use it directly (instead of via Calliope) you may be able to get it to
work, albeit with less concise code?
Or perhaps you may be able to build Cassandra from source with Hadoop 2 /
CDH4 support:
Is a python binding for LBFGS in the works? My co-worker has written one
and can contribute back if it helps.
On Mon, Jun 16, 2014 at 11:00 AM, DB Tsai dbt...@stanford.edu wrote:
Is your data normalized? Sometimes, GD doesn't work well if the data
has wide range. If you are willing to write
Hi Durin,
I guess that blank lines caused the problem (like Aaron said). Right now,
jsonFile does not skip faulty lines. Can you first use sc.textfile to load
the file as RDD[String] and then use filter to filter out those blank lines
(code snippet can be found below)?
val sqlContext = new
Hi All,
I see the following error messages on my worker nodes. Are they due to improper
cleanup or wrong configuration? Any help with this would be great!
14/06/25 12:30:55 INFO SecurityManager: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties14/06/25 12:30:55 INFO
There is no python binding for LBFGS. Feel free to submit a PR.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi mohitja...@gmail.com wrote:
Is a
Hi Yin an Aaron,
thanks for your help, this was indeed the problem. I've counted 1233 blank
lines using grep, and the code snippet below works with those.
From what you said, I guess that skipping faulty lines will be possible in
later versions?
Kind regards,
Simon
--
View this message in
After upgrading to Spark 1.0.0, I get this error:
ERROR org.apache.spark.executor.ExecutorUncaughtExceptionHandler -
Uncaught exception in thread Thread[Executor task launch
worker-2,5,main]
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext,
Is there an equivalent of wholeTextFiles for binary files for example a set
of images ?
Cheers,
Jaonary
i am trying to install spark on Hadoop+Yarn.
I have installed spark using sbt (SPARK_HADOOP_VERSION=2.0.5-alpha
SPARK_YARN=true sbt/sbt assembly ). This has worked fine.
After that I am running :
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.5-alpha.jar
Hi guys, thanks the direction now I have some problem/question:
- in local (test) mode I want to use ElasticClient.local to create es
connection, but in prodution I want to use ElasticClient.remote, to this I
want to pass ElasticClient to mapPartitions, or what is the best practices?
- my stream
On Wed, Jun 25, 2014 at 4:16 PM, boci boci.b...@gmail.com wrote:
Hi guys, thanks the direction now I have some problem/question:
- in local (test) mode I want to use ElasticClient.local to create es
connection, but in prodution I want to use ElasticClient.remote, to this I
want to pass
Hi,
When I try requesting a large number of executors - e.g. 242, it doesn't
seem to actually reach that number. E.g., under the executors tab, I only
see an executor ID of upto 234.
This despite the fact that there're plenty more memory available as well as
CPU cores, etc in the system. In
I'm doing coalesce with shuffle, cache and then do thousands of iterations.
I noticed that sometimes Spark would for no particular reason perform
partial coalesce again after running for a long time - and there was no
exception or failure on the worker's part.
Why is this happening?
Hi all,
I have a 2-machine Spark network I've set up: a master and worker on
machine1, and worker on machine2. When I run 'sbin/start-all.sh',
everything starts up as it should. I see both workers listed on the UI
page. The logs of both workers indicate successful registration with the
Spark
I have a log4j.xml in src/main/resources with
?xml version=1.0 encoding=UTF-8 ?
!DOCTYPE log4j:configuration SYSTEM log4j.dtd
log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/;
[...]
root
priority value =warn /
appender-ref ref=Console /
/root
Hi,
Today Google announced their cloud dataflow, which is very similar to spark
in performing batch processing and stream processing.
How does spark compare to Google cloud dataflow? Are they solutions trying
to aim the same problem?
If you're using the spark-ec2 scripts, you may have to change
/root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
is added to the classpath before Spark's own conf.
On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote:
I have a log4j.xml in
I'm seeing the following message in the log of an executor. Anyone
seen this error? After this, the executor seems to lose the cache, and
but besides that the whole thing slows down drastically - I.e. it gets
stuck in a reduce phase for 40+ minutes, whereas before it was
finishing reduces in 2~3
Hi,
I want to know the full list of functions, syntax, features that Spark SQL
supports, is there some documentations.
Regards,
Xiaobo Gu
You can find something in the API, nothing more than that I think for now.
Gianluca
On 25 Jun 2014, at 23:36, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
I want to know the full list of functions, syntax, features that Spark SQL
supports, is there some documentations.
Regards,
the api only says this :
public JavaSchemaRDD sql(String sqlQuery)Executes a query expressed in SQL,
returning the result as a JavaSchemaRDD
but what kind of sqlQuery we can execute, is there any more documentation?
Xiaobo Gu
-- Original --
From:
52 matches
Mail list logo