Re: Using Spark as web app backend

2014-06-25 Thread Jaonary Rabarisoa
Hi all, Thank you for the reply. Is there any example of spark running in client mode with spray ? I think, I will choose this approach. On Tue, Jun 24, 2014 at 4:55 PM, Koert Kuipers ko...@tresata.com wrote: run your spark app in client mode together with a spray rest service, that the

Need help to make spark sql works in stand alone application

2014-06-25 Thread Jaonary Rabarisoa
Hi all, I'm trying to use spark sql to store data in parquet file. I create the file and insert data into it with the following code : *val conf = new SparkConf().setAppName(MCT).setMaster(local[2]) val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)

Re: how to make saveAsTextFile NOT split output into multiple file?

2014-06-25 Thread randylu
rdd.coalesce() will take effect: rdd.coalesce(1, true).saveAsTextFile(save_path) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-make-saveAsTextFile-NOT-split-output-into-multiple-file-tp8129p8244.html Sent from the Apache Spark User List

Re: Using Spark as web app backend

2014-06-25 Thread Eugen Cepoi
Yeah I agree with Koert, it would be the lightest solution. I have used it quite successfully and it just works. There is not much spark specifics here, you can follow this example https://github.com/jacobus/s4 on how to build your spray service. Then the easy solution would be to have a

Re: Spark slave fail to start with wierd error information

2014-06-25 Thread Peng Cheng
Sorry I just realize that start-slave is for a different task. Please close this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html Sent from the Apache Spark User List mailing list

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster, each with 15G memory, the master page looks all normal: URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077 Workers: 1 Cores: 2 Total, 2 Used Memory: 13.9 GB Total, 512.0 MB Used Applications: 1 Running, 0

Re: Using Spark as web app backend

2014-06-25 Thread Peng Cheng
Totally agree, also there is a class 'SparkSubmit' you can call directly to replace shellscript -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html Sent from the Apache Spark User List mailing list archive at

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread Ulanov, Alexander
Hi Imk, I am not aware of any classifier in MLLib that accept nominal type of data. They do accept RDD of LabeledPoints, which are label + vector of Double. So, you'll need to convert nominal to double. Best regards, Alexander -Original Message- From: lmk

Cassandra and Spark checkpoints

2014-06-25 Thread toivoa
According to „DataStax Brings Spark To Cassandra“ press realese: „DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer.“

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Paul Brown
Hi, Robert -- I wonder if this is an instance of SPARK-2075: https://issues.apache.org/jira/browse/SPARK-2075 -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Wed, Jun 25, 2014 at 6:28 AM, Robert James srobertja...@gmail.com wrote: On 6/24/14, Robert James

Spark and Cassandra - NotSerializableException

2014-06-25 Thread shaiw75
Hi, I am writing a standalone Spark program that gets its data from Cassandra. I followed the examples and created the RDD via the newAPIHadoopRDD() and the ColumnFamilyInputFormat class. The RDD is created, but I get a NotSerializableException when I call the RDD's .groupByKey() method: public

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
The behavior you're seeing is by design, and it is VERY IMPORTANT to understand why this happens because it can cause unexpected behavior in various ways. I learned that the hard way. :-) Spark collapses multiple transforms into a single stage wherever possible (presumably for performance). The

Spark's Hadooop Dependency

2014-06-25 Thread Robert James
To add Spark to a SBT project, I do: libraryDependencies += org.apache.spark %% spark-core % 1.0.0 % provided How do I make sure that the spark version which will be downloaded will depend on, and use, Hadoop 2, and not Hadoop 1? Even with a line: libraryDependencies += org.apache.hadoop %

Re: Spark's Hadooop Dependency

2014-06-25 Thread Koert Kuipers
libraryDependencies ++= Seq( org.apache.spark %% spark-core % versionSpark % provided exclude(org.apache.hadoop, hadoop-client) org.apache.hadoop % hadoop-client % versionHadoop % provided ) On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com wrote: To add Spark to a SBT

Re: Powered by Spark addition

2014-06-25 Thread Alex Gaudio
Hi Matei, Sailthru is also using Spark. Could you please add us to the Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page when you have a chance? Organization Name: Sailthru URL: www.sailthru.com Short Description: Our data science platform uses Spark to

graphx Joining two VertexPartitions with different indexes is slow.

2014-06-25 Thread Koert Kuipers
lately i am seeing a lot of this warning in graphx: org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. i am using Graph.outerJoinVertices to join in data from a regular RDD (that is co-partitioned). i would like this operation to

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly! I’m not sure how complicated it would be to pull off. If someone wouldn’t mind helping to get me pointed in the right direction I would be happy to look into and contribute this functionality. I imagine this would be implemented in the scheduler codebase and there would be some

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Alex Boisvert
Thanks Daniel and Nicholas for the helpful responses. I'll go with coalesce(shuffle = true) and see how things go. On Wed, Jun 25, 2014 at 8:19 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: The behavior you're seeing is by design, and it is VERY IMPORTANT to understand why this happens

jsonFile function in SQLContext does not work

2014-06-25 Thread durin
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23). I'm trying to execute the following code: import org.apache.spark.SparkContext._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)

Spark's Maven dependency on Hadoop 1

2014-06-25 Thread Robert James
According to http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.0 , spark depends on Hadoop 1.0.4. What about the versions of Spark that work with Hadoop 2? Do they also depend on Hadoop 1.0.4? How does everyone handle this?

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin, I just tried this example (nice data, by the way!), *with each JSON object on one line*, and it worked fine: scala rdd.printSchema() root |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef ||-- friends:

Re: Spark 1.0.0 on yarn cluster problem

2014-06-25 Thread Andrew Or
Hi Sophia, did you ever resolve this? A common cause for not giving resources to the job is that the RM cannot communicate with the workers. This itself has many possible causes. Do you have a full stack trace from the logs? Andrew 2014-06-13 0:46 GMT-07:00 Sophia sln-1...@163.com: With the

Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Hi, (My excuses for the cross-post from SO) I'm trying to create Cassandra SSTables from the results of a batch computation in Spark. Ideally, each partition should create the SSTable for the data it holds in order to parallelize the process as much as possible (and probably even stream it to

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Zongheng Yang, thanks for your response. Reading your answer, I did some more tests and realized that analyzing very small parts of the dataset (which is ~130GB in ~4.3M lines) works fine. The error occurs when I analyze larger parts. Using 5% of the whole data, the error is the same as

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat. An example of using it with Hadoop is here: http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html Using it with Spark will be similar to the examples:

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Aaron Davidson
Is it possible you have blank lines in your input? Not that this should be an error condition, but it may be what's causing it. On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote: Hi Zongheng Yang, thanks for your response. Reading your answer, I did some more tests and

Re: spark streaming questions

2014-06-25 Thread Chen Song
Thanks Anwar. On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal anriza...@gmail.com wrote: On Tue, Jun 17, 2014 at 5:39 PM, Chen Song chen.song...@gmail.com wrote: Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Thanks Nick. We used the CassandraOutputFormat through Calliope. The Calliope API makes the CassandraOutputFormat quite accessible and is cool to work with. It worked fine at prototype level, but we had Hadoop version conflicts when we put it in our Spark environment (Using our Spark assembly

wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles Is it possible to read gzipped files using wholeTextFiles()? Alternately, is it possible to read the source file names using textFile()? ​ -- View this

semi join spark streaming

2014-06-25 Thread Chen Song
Is there a easy way to do semi join in spark streaming? Here is my problem briefly, I have a DStream that will generate a set of values. I would like to check the existence in this set in other DStreams. Is there a easy and standard way to model this problem. If not, can I write spark streaming

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
Expanded to 4 nodes and change the workers to listen to public DNS, but still it shows the same error (which is obviously wrong). I can't believe I'm the first to encounter this issue. -- View this message in context:

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
Right, ok. I can't say I've used the Cassandra OutputFormats before. But perhaps if you use it directly (instead of via Calliope) you may be able to get it to work, albeit with less concise code? Or perhaps you may be able to build Cassandra from source with Hadoop 2 / CDH4 support:

Re: pyspark regression results way off

2014-06-25 Thread Mohit Jaggi
Is a python binding for LBFGS in the works? My co-worker has written one and can contribute back if it helps. On Mon, Jun 16, 2014 at 11:00 AM, DB Tsai dbt...@stanford.edu wrote: Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)? val sqlContext = new

Worker nodes: Error messages

2014-06-25 Thread Sameer Tilak
Hi All, I see the following error messages on my worker nodes. Are they due to improper cleanup or wrong configuration? Any help with this would be great! 14/06/25 12:30:55 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/06/25 12:30:55 INFO

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai
There is no python binding for LBFGS. Feel free to submit a PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is a

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in

Hadoop interface vs class

2014-06-25 Thread Robert James
After upgrading to Spark 1.0.0, I get this error: ERROR org.apache.spark.executor.ExecutorUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary

trouble: Launching spark on hadoop + yarn.

2014-06-25 Thread sdeb
i am trying to install spark on Hadoop+Yarn. I have installed spark using sbt (SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly ). This has worked fine. After that I am running : SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.5-alpha.jar

Re: ElasticSearch enrich

2014-06-25 Thread boci
Hi guys, thanks the direction now I have some problem/question: - in local (test) mode I want to use ElasticClient.local to create es connection, but in prodution I want to use ElasticClient.remote, to this I want to pass ElasticClient to mapPartitions, or what is the best practices? - my stream

Re: ElasticSearch enrich

2014-06-25 Thread Holden Karau
On Wed, Jun 25, 2014 at 4:16 PM, boci boci.b...@gmail.com wrote: Hi guys, thanks the direction now I have some problem/question: - in local (test) mode I want to use ElasticClient.local to create es connection, but in prodution I want to use ElasticClient.remote, to this I want to pass

Number of executors smaller than requested in YARN.

2014-06-25 Thread Sung Hwan Chung
Hi, When I try requesting a large number of executors - e.g. 242, it doesn't seem to actually reach that number. E.g., under the executors tab, I only see an executor ID of upto 234. This despite the fact that there're plenty more memory available as well as CPU cores, etc in the system. In

Does Spark restart cached workers even without failures?

2014-06-25 Thread Sung Hwan Chung
I'm doing coalesce with shuffle, cache and then do thousands of iterations. I noticed that sometimes Spark would for no particular reason perform partial coalesce again after running for a long time - and there was no exception or failure on the worker's part. Why is this happening?

Spark standalone network configuration problems

2014-06-25 Thread Shannon Quinn
Hi all, I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark

Re: Changing log level of spark

2014-06-25 Thread Tobias Pfeiffer
I have a log4j.xml in src/main/resources with ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE log4j:configuration SYSTEM log4j.dtd log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/; [...] root priority value =warn / appender-ref ref=Console / /root

Spark vs Google cloud dataflow

2014-06-25 Thread Aureliano Buendia
Hi, Today Google announced their cloud dataflow, which is very similar to spark in performing batch processing and stream processing. How does spark compare to Google cloud dataflow? Are they solutions trying to aim the same problem?

Re: Changing log level of spark

2014-06-25 Thread Aaron Davidson
If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: I have a log4j.xml in

Spark executor error

2014-06-25 Thread Sung Hwan Chung
I'm seeing the following message in the log of an executor. Anyone seen this error? After this, the executor seems to lose the cache, and but besides that the whole thing slows down drastically - I.e. it gets stuck in a reduce phase for 40+ minutes, whereas before it was finishing reduces in 2~3

Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread guxiaobo1982
Hi, I want to know the full list of functions, syntax, features that Spark SQL supports, is there some documentations. Regards, Xiaobo Gu

Re: Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread Gianluca Privitera
You can find something in the API, nothing more than that I think for now. Gianluca On 25 Jun 2014, at 23:36, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, I want to know the full list of functions, syntax, features that Spark SQL supports, is there some documentations. Regards,

Re: Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread guxiaobo1982
the api only says this : public JavaSchemaRDD sql(String sqlQuery)Executes a query expressed in SQL, returning the result as a JavaSchemaRDD but what kind of sqlQuery we can execute, is there any more documentation? Xiaobo Gu -- Original -- From: