Re: wholeTextFiles like for binary files ?

2014-06-25 Thread Akhil Das
You cannot read image files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gripped files because they are not splittable (source proving it): override def createRecordReader( split: InputSplit, contex

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Akhil Das
Try deleting the .iv2 directory in your home and then do a sbt clean assembly would solve this issue i guess. Thanks Best Regards On Thu, Jun 26, 2014 at 3:10 AM, Robert James wrote: > In case anyone else is having this problem, deleting all ivy's cache, > then doing a sbt clean, then recompil

Re: Worker nodes: Error messages

2014-06-25 Thread Akhil Das
Can you paste the stderr from the worker logs? (Found in work/ app-20140625133031-0002/ directory) Most likely you might need to set SPARK_MASTER_IP in your spark-env.sh file (Not sure why i'm seeing akka.tcp://spark@localhost:56569 instead of akka.tcp://spark@*serverip*:56569) Thanks Best Regard

Re: Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread guxiaobo1982
the api only says this : public JavaSchemaRDD sql(String sqlQuery)Executes a query expressed in SQL, returning the result as a JavaSchemaRDD but what kind of sqlQuery we can execute, is there any more documentation? Xiaobo Gu -- Original -- From: "Gia

Re: Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread Gianluca Privitera
You can find something in the API, nothing more than that I think for now. Gianluca On 25 Jun 2014, at 23:36, guxiaobo1982 wrote: > Hi, > > I want to know the full list of functions, syntax, features that Spark SQL > supports, is there some documentations. > > > Regards, > > Xiaobo Gu

Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread guxiaobo1982
Hi, I want to know the full list of functions, syntax, features that Spark SQL supports, is there some documentations. Regards, Xiaobo Gu

Spark executor error

2014-06-25 Thread Sung Hwan Chung
I'm seeing the following message in the log of an executor. Anyone seen this error? After this, the executor seems to lose the cache, and but besides that the whole thing slows down drastically - I.e. it gets stuck in a reduce phase for 40+ minutes, whereas before it was finishing reduces in 2~3 se

Re: Changing log level of spark

2014-06-25 Thread Aaron Davidson
If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer wrote: > I have a log4j.xml in src/main/resources with > >

Spark vs Google cloud dataflow

2014-06-25 Thread Aureliano Buendia
Hi, Today Google announced their cloud dataflow, which is very similar to spark in performing batch processing and stream processing. How does spark compare to Google cloud dataflow? Are they solutions trying to aim the same problem?

Re: Changing log level of spark

2014-06-25 Thread Tobias Pfeiffer
I have a log4j.xml in src/main/resources with http://jakarta.apache.org/log4j/";> [...] and that is included in the jar I package with `sbt assembly`. That works fine for me, at least on the driver. Tobias On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck wrote

Spark standalone network configuration problems

2014-06-25 Thread Shannon Quinn
Hi all, I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark

Does Spark restart cached workers even without failures?

2014-06-25 Thread Sung Hwan Chung
I'm doing coalesce with shuffle, cache and then do thousands of iterations. I noticed that sometimes Spark would for no particular reason perform partial coalesce again after running for a long time - and there was no exception or failure on the worker's part. Why is this happening?

Number of executors smaller than requested in YARN.

2014-06-25 Thread Sung Hwan Chung
Hi, When I try requesting a large number of executors - e.g. 242, it doesn't seem to actually reach that number. E.g., under the executors tab, I only see an executor ID of upto 234. This despite the fact that there're plenty more memory available as well as CPU cores, etc in the system. In fact,

Re: ElasticSearch enrich

2014-06-25 Thread Holden Karau
On Wed, Jun 25, 2014 at 4:16 PM, boci wrote: > Hi guys, thanks the direction now I have some problem/question: > - in local (test) mode I want to use ElasticClient.local to create es > connection, but in prodution I want to use ElasticClient.remote, to this I > want to pass ElasticClient to mapPa

Re: ElasticSearch enrich

2014-06-25 Thread boci
Hi guys, thanks the direction now I have some problem/question: - in local (test) mode I want to use ElasticClient.local to create es connection, but in prodution I want to use ElasticClient.remote, to this I want to pass ElasticClient to mapPartitions, or what is the best practices? - my stream ou

trouble: Launching spark on hadoop + yarn.

2014-06-25 Thread sdeb
i am trying to install spark on Hadoop+Yarn. I have installed spark using sbt (SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly ). This has worked fine. After that I am running : SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.5-alpha.jar ./bin/spark-cl

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary

Hadoop interface vs class

2014-06-25 Thread Robert James
After upgrading to Spark 1.0.0, I get this error: ERROR org.apache.spark.executor.ExecutorUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, bu

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Robert James
In case anyone else is having this problem, deleting all ivy's cache, then doing a sbt clean, then recompiling everything, repackaging, and reassemblying, seems to have solved the problem. (From the sbt docs, it seems that having to delete ivy's cache means a bug in sbt) On 6/25/14, Robert James

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. >From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in c

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai
There is no python binding for LBFGS. Feel free to submit a PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi wrote: > Is a python binding for

Worker nodes: Error messages

2014-06-25 Thread Sameer Tilak
Hi All, I see the following error messages on my worker nodes. Are they due to improper cleanup or wrong configuration? Any help with this would be great! 14/06/25 12:30:55 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/06/25 12:30:55 INFO

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)? val sqlContext = new org.ap

Re: pyspark regression results way off

2014-06-25 Thread Mohit Jaggi
Is a python binding for LBFGS in the works? My co-worker has written one and can contribute back if it helps. On Mon, Jun 16, 2014 at 11:00 AM, DB Tsai wrote: > Is your data normalized? Sometimes, GD doesn't work well if the data > has wide range. If you are willing to write scala code, you can

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
Right, ok. I can't say I've used the Cassandra OutputFormats before. But perhaps if you use it directly (instead of via Calliope) you may be able to get it to work, albeit with less concise code? Or perhaps you may be able to build Cassandra from source with Hadoop 2 / CDH4 support: https://group

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
Expanded to 4 nodes and change the workers to listen to public DNS, but still it shows the same error (which is obviously wrong). I can't believe I'm the first to encounter this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial

semi join spark streaming

2014-06-25 Thread Chen Song
Is there a easy way to do semi join in spark streaming? Here is my problem briefly, I have a DStream that will generate a set of values. I would like to check the existence in this set in other DStreams. Is there a easy and standard way to model this problem. If not, can I write spark streaming j

wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles Is it possible to read gzipped files using wholeTextFiles()? Alternately, is it possible to read the source file names using textFile()? ​ -- View this message

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Thanks Nick. We used the CassandraOutputFormat through Calliope. The Calliope API makes the CassandraOutputFormat quite accessible and is cool to work with. It worked fine at prototype level, but we had Hadoop version conflicts when we put it in our Spark environment (Using our Spark assembly co

Re: spark streaming questions

2014-06-25 Thread Chen Song
Thanks Anwar. On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal wrote: > > On Tue, Jun 17, 2014 at 5:39 PM, Chen Song wrote: > >> Hey >> >> I am new to spark streaming and apologize if these questions have been >> asked. >> >> * In StreamingContext, reduceByKey() seems to only work on the RDDs of

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Aaron Davidson
Is it possible you have blank lines in your input? Not that this should be an error condition, but it may be what's causing it. On Wed, Jun 25, 2014 at 11:57 AM, durin wrote: > Hi Zongheng Yang, > > thanks for your response. Reading your answer, I did some more tests and > realized that analyzi

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat. An example of using it with Hadoop is here: http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html Using it with Spark will be similar to the examples: https://github.com/apache/spark/blob/maste

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Zongheng Yang, thanks for your response. Reading your answer, I did some more tests and realized that analyzing very small parts of the dataset (which is ~130GB in ~4.3M lines) works fine. The error occurs when I analyze larger parts. Using 5% of the whole data, the error is the same as posted

Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Hi, (My excuses for the cross-post from SO) I'm trying to create Cassandra SSTables from the results of a batch computation in Spark. Ideally, each partition should create the SSTable for the data it holds in order to parallelize the process as much as possible (and probably even stream it to the

Re: Spark 1.0.0 on yarn cluster problem

2014-06-25 Thread Andrew Or
Hi Sophia, did you ever resolve this? A common cause for not giving resources to the job is that the RM cannot communicate with the workers. This itself has many possible causes. Do you have a full stack trace from the logs? Andrew 2014-06-13 0:46 GMT-07:00 Sophia : > With the yarn-client mode

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin, I just tried this example (nice data, by the way!), *with each JSON object on one line*, and it worked fine: scala> rdd.printSchema() root |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef ||-- friends: ArrayType[org.apache.spark.sql.catalyst.types.StructType$

Spark's Maven dependency on Hadoop 1

2014-06-25 Thread Robert James
According to http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.0 , spark depends on Hadoop 1.0.4. What about the versions of Spark that work with Hadoop 2? Do they also depend on Hadoop 1.0.4? How does everyone handle this?

jsonFile function in SQLContext does not work

2014-06-25 Thread durin
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23). I'm trying to execute the following code: import org.apache.spark.SparkContext._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.jsonFile("hdfs://host:9100/user/myuser/data.json")

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Alex Boisvert
Thanks Daniel and Nicholas for the helpful responses. I'll go with coalesce(shuffle = true) and see how things go. On Wed, Jun 25, 2014 at 8:19 AM, Daniel Siegmann wrote: > The behavior you're seeing is by design, and it is VERY IMPORTANT to > understand why this happens because it can cause u

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly! I’m not sure how complicated it would be to pull off. If someone wouldn’t mind helping to get me pointed in the right direction I would be happy to look into and contribute this functionality. I imagine this would be implemented in the scheduler codebase and there would be some s

graphx Joining two VertexPartitions with different indexes is slow.

2014-06-25 Thread Koert Kuipers
lately i am seeing a lot of this warning in graphx: org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. i am using Graph.outerJoinVertices to join in data from a regular RDD (that is co-partitioned). i would like this operation to

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread Debasish Das
Libsvm dataset converters are data dependent since your input data can be in any serialization format and not necessarily csv... We have flows that coverts hdfs data to libsvm/sparse vector rdd which is sent to mllib I am not sure if it will be easy to standardize libsvm converter on data tha

Re: Powered by Spark addition

2014-06-25 Thread Alex Gaudio
Hi Matei, Sailthru is also using Spark. Could you please add us to the Powered By Spark page when you have a chance? Organization Name: Sailthru URL: www.sailthru.com Short Description: Our data science platform uses Spark to

Re: Spark's Hadooop Dependency

2014-06-25 Thread Koert Kuipers
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % versionSpark % "provided" exclude("org.apache.hadoop", "hadoop-client") "org.apache.hadoop" % "hadoop-client" % versionHadoop % "provided" ) On Wed, Jun 25, 2014 at 11:26 AM, Robert James wrote: > To add Spark to a SBT projec

Spark's Hadooop Dependency

2014-06-25 Thread Robert James
To add Spark to a SBT project, I do: libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" % "provided" How do I make sure that the spark version which will be downloaded will depend on, and use, Hadoop 2, and not Hadoop 1? Even with a line: libraryDependencies += "org.apache.h

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
The behavior you're seeing is by design, and it is VERY IMPORTANT to understand why this happens because it can cause unexpected behavior in various ways. I learned that the hard way. :-) Spark collapses multiple transforms into a single "stage" wherever possible (presumably for performance). The

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Robert James
Thanks Paul. I'm unable to follow the discussion on SPARK-2075. But how would you recommend I test or follow up on that? Is there a workaround? On 6/25/14, Paul Brown wrote: > Hi, Robert -- > > I wonder if this is an instance of SPARK-2075: > https://issues.apache.org/jira/browse/SPARK-2075 > >

Spark and Cassandra - NotSerializableException

2014-06-25 Thread shaiw75
Hi, I am writing a standalone Spark program that gets its data from Cassandra. I followed the examples and created the RDD via the newAPIHadoopRDD() and the ColumnFamilyInputFormat class. The RDD is created, but I get a NotSerializableException when I call the RDD's .groupByKey() method: public s

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Paul Brown
Hi, Robert -- I wonder if this is an instance of SPARK-2075: https://issues.apache.org/jira/browse/SPARK-2075 -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Wed, Jun 25, 2014 at 6:28 AM, Robert James wrote: > On 6/24/14, Robert James wrote: > > My app works f

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Robert James
On 6/24/14, Robert James wrote: > My app works fine under Spark 0.9. I just tried upgrading to Spark > 1.0, by downloading the Spark distro to a dir, changing the sbt file, > and running sbt assembly, but I get now NoSuchMethodErrors when trying > to use spark-submit. > > I copied in the SimpleAp

RE: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-25 Thread Aaron
Thank you, Mayur. Could you provide some pseudo code for what the direct lookup would be like? I have struggled to implement that. I ended up doing a Cartesian product of (key, values) to itself. Something like this… mappedToLines = input.map(lambda line: line.split()) items = mappedToLines.

Cassandra and Spark checkpoints

2014-06-25 Thread toivoa
According to „DataStax Brings Spark To Cassandra“ press realese: „DataStax has partnered with Databricks, the company founded by the creators of Apache Spark, to build a supported, open source integration between the two platforms. The partners expect to have the integration ready by this summer.“

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread Ulanov, Alexander
Hi Imk, I am not aware of any classifier in MLLib that accept nominal type of data. They do accept RDD of LabeledPoints, which are label + vector of Double. So, you'll need to convert nominal to double. Best regards, Alexander -Original Message- From: lmk [mailto:lakshmi.muralikrish...

Re: Is there anyone who can explain why the function of ALS.train give different shuffle results when execute the same transformation flatMap

2014-06-25 Thread Nick Pentreath
How many users and items do you have? Each iteration will first iterate through users and then items, so each iteration of ALS actually ends up having 2 flatMap operations. I'd assume that you have many more users than items (or vice versa), which is why one of the operations generates more data.

Is there anyone who can explain why the function of ALS.train give different shuffle results when execute the same transformation flatMap

2014-06-25 Thread Lizhengbing (bing, BIPA)
Sometimes, shuffle write of flatMap is 14.8G and sometimes is 647.9M Why does this happen? The size of training data is about 1.5G. and the feature number is 200 Stage Id Description Submitted Duration Tasks: Succeeded/Total Shuffle Read Shuffle Write 114 flatMap at ALS.scala:434 2014/

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread lmk
Hi Alexander, Just one more question on a related note. Should I be following the same procedure even if my data is nominal (categorical), but having a lot of combinations? (In Weka I used to have it as nominal data) Regards, -lmk -- View this message in context: http://apache-spark-user-list.

Re: Using Spark as web app backend

2014-06-25 Thread Peng Cheng
Totally agree, also there is a class 'SparkSubmit' you can call directly to replace shellscript -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html Sent from the Apache Spark User List mailing list archive at Nabbl

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster, each with 15G memory, the master page looks all normal: URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077 Workers: 1 Cores: 2 Total, 2 Used Memory: 13.9 GB Total, 512.0 MB Used Applications: 1 Running, 0 Completed

Re: Spark slave fail to start with wierd error information

2014-06-25 Thread Peng Cheng
Sorry I just realize that start-slave is for a different task. Please close this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html Sent from the Apache Spark User List mailing list archive

Re: Using Spark as web app backend

2014-06-25 Thread Eugen Cepoi
Yeah I agree with Koert, it would be the lightest solution. I have used it quite successfully and it just works. There is not much spark specifics here, you can follow this example https://github.com/jacobus/s4 on how to build your spray service. Then the easy solution would be to have a SparkCont

Re: how to make saveAsTextFile NOT split output into multiple file?

2014-06-25 Thread randylu
rdd.coalesce() will take effect: rdd.coalesce(1, true).saveAsTextFile(save_path) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-make-saveAsTextFile-NOT-split-output-into-multiple-file-tp8129p8244.html Sent from the Apache Spark User List mailing

Need help to make spark sql works in stand alone application

2014-06-25 Thread Jaonary Rabarisoa
Hi all, I'm trying to use spark sql to store data in parquet file. I create the file and insert data into it with the following code : *val conf = new SparkConf().setAppName("MCT").setMaster("local[2]") val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)

Re: Using Spark as web app backend

2014-06-25 Thread Jaonary Rabarisoa
Hi all, Thank you for the reply. Is there any example of spark running in client mode with spray ? I think, I will choose this approach. On Tue, Jun 24, 2014 at 4:55 PM, Koert Kuipers wrote: > run your spark app in client mode together with a spray rest service, that > the front end can talk t