spark-submit and logging

2014-11-20 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application jar to a YARN cluster. I want to deliver a single jar file to my users, so I would like to avoid to tell them also, please put that log4j.xml file somewhere and add that path to the spark-submit command. I thought it would be sufficient that

SparkSQL exception handling

2014-11-20 Thread Daniel Haviv
Hi, I'm loading a bunch of json files and there seems to be problems with specific files (either schema changes or incomplete files). I'd like to catch the inconsistent files but I'm not sure how to do it. This is the exception I get: 14/11/20 00:13:49 INFO cluster.YarnClientClusterScheduler:

RDD Action require data from Another RDD

2014-11-20 Thread nsareen
Hi, We have a requirement, where we have two data sets represented by RDD's RDDA RDDB. For performing an aggregation operation on RDDA, the action would need RDDB's subset of data, wanted to understand if there is a best practice in doing this ? Dont even know how will this be possible as of

Re: Optimizing text file parsing, many small files versus few big files

2014-11-20 Thread rzykov
You could use combineTextFile from https://github.com/RetailRocket/SparkMultiTool It combines input files before mappers by means of Hadoop CombineFileInputFormat. In our case it reduced the number of mappers from 10 to approx 3000 and made job significantly faster. Example: import

Re: tableau spark sql cassandra

2014-11-20 Thread jererc
Hi! The hive table is an external table, which I created like this: CREATE EXTERNAL TABLE MyHiveTable ( id int, data string ) STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' TBLPROPERTIES (cassandra.host = 10.194.30.2, cassandra.ks.name = test ,

Re: Naive Baye's classification confidence

2014-11-20 Thread Sean Owen
I assume that all examples do actually fall into exactly one of the classes. If you always have to make a prediction then you always take the most probable class. If you can choose to make no classification for lack of confidence, yes you want to pick a per-class threshold and take the most

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Thanks a lot Sean. You are correct in assuming that my examples fall under a single category. It is interesting to see that the posterior probability can actually be treated as something that is stable enough to have a constant threshold value on per class basis. It would, I assume, keep changing

Re: SparkSQL exception handling

2014-11-20 Thread Daniel Haviv
Update: I tried surrounding the problematic code with try and catch but that does not do the trick: try { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val jsonFiles=sqlContext.jsonFile(/requests.loading) } catch { case _: Throwable = // Catching all exceptions and

Re: Is it possible to save the streams to one single file?

2014-11-20 Thread Akhil Das
Can you not use *hadoop getmerge* option to merge the files afterwards? Older version of HDFS are immutable, meaning once you close the file then you won't be able to modify it. In the newer version they have support for it. Code below writes the output to one directory (deletes the previous

Re: spark-submit and logging

2014-11-20 Thread Sean Owen
I think the standard practice is to include your log config file among the files uploaded to YARN containers, and then set -Dlog4j.configuration=yourfile.xml in spark.{executor.driver}.extraJavaOptions ? http://spark.apache.org/docs/latest/running-on-yarn.html On Thu, Nov 20, 2014 at 9:20 AM,

Re: Naive Baye's classification confidence

2014-11-20 Thread Sean Owen
Yes, certainly you need to consider the problem of how and when you update the model with new info. The principle is the same. Low or high posteriors aren't wrong per se. It seems normal in fact that one class is more probable than others, maybe a lot more. On Thu, Nov 20, 2014 at 10:31 AM,

Re: Is it possible to save the streams to one single file?

2014-11-20 Thread Sean Owen
You can repartition to 1 partition to generate 1 output file, but, that has other potentially bad implications for your processing. You would be using 1 partition and 1 worker only for the final stages of processing. On Thu, Nov 20, 2014 at 11:32 AM, jishnu.prat...@wipro.com wrote: Hi

Error while submitting spark streaming job in YARN

2014-11-20 Thread Tapas Swain
Hi All, I am working on a spark streaming job . The job is supposed to read streaming data from Kafka. But after submitting the job its showing org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! and the following lines are getting printed

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Sean, My last sentence didn't come out right. Let me try to explain my question again. For instance, I have two categories, C1 and C2. I have trained 100 samples for C1 and 10 samples for C2. Now, I predict two samples one each of C1 and C2, namely S1 and S2 respectively. I get the following

RDD memory and storage level option

2014-11-20 Thread Tsai Li Ming
Hi, This is on version 1.1.0. I’m did a simple test on MEMORY_AND_DISK storage level. var file = sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK) file.count() The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node:

Re: Naive Baye's classification confidence

2014-11-20 Thread Sean Owen
Up-sampling one class would change the result, yes. The prior for C2 would increase and so does its posterior across the board. But that is dramatically changing your input: C2 isn't as prevalent as C1, but you are pretending it is. Their priors aren't the same. If you want to assume a uniform

Re: Optimizing text file parsing, many small files versus few big files

2014-11-20 Thread tvas
Hello Rzykov, I tried you approach but unfortunately I'm getting an error. What I get is: [info] Caused by: java.lang.reflect.InvocationTargetException [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [info] at

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
I believe assuming uniform priors is the way to go for my use case. I am not sure about how to 'drop the prior term' with Mllib. I am just providing the samples as they come after creating term vectors for each sample. But I guess I can Google that information. I appreciate all the help. Spark

Slow performance in spark streaming

2014-11-20 Thread Blackeye
I am using spark streaming 1.1.0 locally (not in a cluster). I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code: /def main(args : Array[String]){ val master = local[8] val conf = new

MLIB KMeans Exception

2014-11-20 Thread Alan Prando
Hi Folks! I'm running a Python Spark job on a cluster with 1 master and 10 slaves (64G RAM and 32 cores each machine). This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and call Kmeans method as following: # SLAVE CODE - Reading features from HDFS def

Please help me get started on Apache Spark

2014-11-20 Thread Saurabh Agrawal
Friends, I am pretty new to Spark as much as to Scala, MLib and the entire Hadoop stack!! It would be so much help if I could be pointed to some good books on Spark and MLib? Further, does MLib support any algorithms for B2B cross sell/ upsell or customer retention (out of the box

Re: Please help me get started on Apache Spark

2014-11-20 Thread Darin McBeath
Take a look at the O'Reilly Learning Spark (Early Release) book.  I've found this very useful. Darin. From: Saurabh Agrawal saurabh.agra...@markit.com To: user@spark.apache.org user@spark.apache.org Sent: Thursday, November 20, 2014 9:04 AM Subject: Please help me get started on Apache

Re: Please help me get started on Apache Spark

2014-11-20 Thread Guibert. J Tchinde
For Spark, You can start with a new book like : https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html I think the paper book is out now, You can also have a look on tutorials documentation guide available on : https://spark.apache.org/docs/1.1.0/mllib-guide.html

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-20 Thread Chitturi Padma
Hi, I tried with try catch blocks. Infact, inside mapPartitionsWithIndex, method is invoked which does the operation. I put the operations inside the function in try...catch block but thats of no use...still the error persists. Even I commented all the operations and a simple print statement

Spark doesn't kill worker process after failing on Yarn

2014-11-20 Thread rzykov
Dear all, We encountered a problem with failed Spark jobs. We have a Spark/Hadoop cluster - CDH 5.1.2 + Spark 1.1 After launching a spark job with command: ~/soft/spark-1.1.0-bin-hadoop2.3/bin/spark-submit --master yarn-cluster --executor-memory 4G --driver-memory 4G --class

Re: tableau spark sql cassandra

2014-11-20 Thread jererc
I finally solved this problem. The org.apache.hadoop.mapreduce.JobContext is a class in hadoop 2.0 and is an interface in hadoop = 2.0. I have to use a spark build for hadoop v1. So spark-sql seems fine. But, the thriftserver does not work with my config! Here is my spark-env.sh:

Re: NEW to spark and sparksql

2014-11-20 Thread Sam Flint
So you are saying to query an entire day of data I would need to create one RDD for every hour and then union them into one RDD. After I have the one RDD I would be able to query for a=2 throughout the entire day. Please correct me if I am wrong. Thanks On Wed, Nov 19, 2014 at 5:53 PM,

Spark SQL Exception handling

2014-11-20 Thread Daniel Haviv
Hi Guys, I really need your help with this: I'm loading a bunch of json files uploaded via webhdfs, some of them have some incosistencies (the json ends mid-line for example) and that causes my whole application to fail. How can I continue processing valid json records without failing the

Best way to store RDD data?

2014-11-20 Thread RJ Nowling
Hi all, I'm working on an application that has several tables (RDDs of tuples) of data. Some of the types are complex-ish (e.g., date time objects). I'd like to use something like case classes for each entry. What is the best way to store the data to disk in a text format without writing custom

Windows+Pyspark+YARN

2014-11-20 Thread Ángel Álvarez Pascua
https://issues.apache.org/jira/browse/SPARK-1825 I've had the following problems to make Windows+Pyspark+YARN work properly: 1. net.ScriptBasedMapping: Exception running /etc/hadoop/conf.cloudera.yarn/topology.py FIX? Comment the net.topology.script.file.name property configuration in the file

Re: spark-submit and logging

2014-11-20 Thread Matt Narrell
How do I configure the files to be uploaded to YARN containers. So far, I’ve only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS location of the Spark JAR, but I’m not sure how to prescribe other files for uploading (e.g., spark-env.sh) mn On Nov 20, 2014, at 4:08

Re: spark-shell giving me error of unread block data

2014-11-20 Thread Anson Abraham
Didn't really edit the configs as much .. but here's what the spark-env.sh is: #!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## export SPARK_HOME=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark export

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
Also attaching the parquet file if anyone wants to take a further look. On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com wrote: So, I am seeing this issue with spark sql throwing an exception when trying to read selective columns from a thrift parquet file and also when

Re: Spark Streaming not working in YARN mode

2014-11-20 Thread Akhil Das
Cool On 20 Nov 2014 22:01, kam lee cloudher...@gmail.com wrote: Yes, fixed by setting --executor-cores to 2 or higher. Thanks a lot! Really appreciate it! cloud On Wed, Nov 19, 2014 at 10:48 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Make sure the executor cores are set to a value

Spark S3 Performance

2014-11-20 Thread Nitay Joffe
I have a simple S3 job to read a text file and do a line count. Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The file is about 1.2GB. My setup is standalone spark cluster with 4 workers each with 2 cores / 16GB ram. I'm using branch-1.2 code built against hadoop 2.4 (though

Incremental loading data slows performance

2014-11-20 Thread Gordon Benjamin
Hi, We are seeing bad performance as we incrementally load data. Here is the config Spark standalone cluster spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's spark03 (spark worker): 15GB RAM, 8vCPU's spark04 (spark

ClassNotFoundException in standalone mode

2014-11-20 Thread Benoit Pasquereau
Hi Guys, I'm having an issue in standalone mode (Spark 1.1, Hadoop 2.4, Windows Server 2008). A very simple program runs fine in local mode but fails in standalone mode. Here is the error: 14/11/20 17:01:53 INFO DAGScheduler: Failed to run count at SimpleApp.scala:22 Exception in thread main

Re: tableau spark sql cassandra

2014-11-20 Thread jererc
Well, after many attempts I can now successfully run the thrift server using root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 1 (the command was failing because of the

Re: Getting spark job progress programmatically

2014-11-20 Thread andy petrella
Awesome! And Patrick just gave his LGTM ;-) On Wed Nov 19 2014 at 5:13:17 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Thanks for pointing this out Mark. Had totally missed the existing JIRA items On Wed Nov 19 2014 at 21:42:19 Mark Hamstra m...@clearstorydata.com wrote: This is

Re: Incremental loading data slows performance

2014-11-20 Thread Gordon Benjamin
To follow on: I asked the developer how we incrementally load data and the response was no. union only for updated records (every night) For every minutes export algorithm next: 1. upload file to hadoop. 2. load data inpath... overwrite into table _incremental; 3. insert into table ..._cached

Re: How to incrementally compile spark examples using mvn

2014-11-20 Thread Marcelo Vanzin
Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how

Re: MLIB KMeans Exception

2014-11-20 Thread Xiangrui Meng
How many features and how many partitions? You set kmeans_clusters to 1. If the feature dimension is large, it would be really expensive. You can check the WebUI and see task failures there. The stack trace you posted is from the driver. Btw, the total memory you have is 64GB * 10, so you can

RE: tableau spark sql cassandra

2014-11-20 Thread Ashic Mahtab
Hi Jerome, I've been trying to get this working as well... Where are you specifying cassandra parameters (i.e. seed nodes, consistency levels, etc.)? -Ashic. Date: Thu, 20 Nov 2014 10:34:58 -0700 From: jer...@gmail.com To: u...@spark.incubator.apache.org Subject: Re: tableau spark sql

How can I read this avro file using spark scala?

2014-11-20 Thread al b
I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark

Re: How can I read this avro file using spark scala?

2014-11-20 Thread Michael Armbrust
One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several

Adding partitions to parquet data

2014-11-20 Thread Sadhan Sood
We are loading parquet data as temp tables but wondering if there is a way to add a partition to the data without going through hive (we still want to use spark's parquet serde as compared to hive). The data looks like - /date1/file1, /date1/file2 ... , /date2/file1, /date2/file2,/daten/filem

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Michael Armbrust
Which version are you running on again? On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com wrote: Also attaching the parquet file if anyone wants to take a further look. On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com wrote: So, I am seeing this issue

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
I am running on master, pulled yesterday I believe but saw the same issue with 1.2.0 On Thu, Nov 20, 2014 at 1:37 PM, Michael Armbrust mich...@databricks.com wrote: Which version are you running on again? On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com wrote: Also

Re: Spark SQL Exception handling

2014-11-20 Thread Michael Armbrust
If you run master or the 1.2 preview release then it should automatically skip lines that fail to parse. The corrupted text will be in the column _corrupted_record and the other columns will be null. On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com wrote: Hi Guys, I really

Re: NEW to spark and sparksql

2014-11-20 Thread Michael Armbrust
I believe functions like sc.textFile will also accept paths with globs for example /data/*/ which would read all the directories into a single RDD. Under the covers I think it is just using Hadoop's FileInputFormat, in case you want to google for the full list of supported syntax. On Thu, Nov 20,

Re: Spark SQL Exception handling

2014-11-20 Thread Daniel Haviv
Thanks I'll try it out! On Thu, Nov 20, 2014 at 8:39 PM, Michael Armbrust mich...@databricks.com wrote: If you run master or the 1.2 preview release then it should automatically skip lines that fail to parse. The corrupted text will be in the column _corrupted_record and the other columns

RE: tableau spark sql cassandra

2014-11-20 Thread Mohammed Guller
Hi Jerome, This is cool. It would be great if you could share more details about you got your setup to work finally. For example, what additional libraries/jars you are using. How are you configuring the ThriftServer to use the additional jars to communicate with Cassandra? In addition, how

Re: spark-submit and logging

2014-11-20 Thread Marcelo Vanzin
Check the --files argument in the output spark-submit -h. On Thu, Nov 20, 2014 at 7:51 AM, Matt Narrell matt.narr...@gmail.com wrote: How do I configure the files to be uploaded to YARN containers. So far, I’ve only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS

[GraphX] Mining GeoData (OSM)

2014-11-20 Thread andy petrella
Guys, After talking with Ankur, it turned out that sharing the talk we gave at ScalaIO (France) would be worthy. So there you go, and don't hesitate to share your thoughts ;-)/ http://www.slideshare.net/noootsab/machine-learning-and-graphx Greetz, andy

Re: spark-submit and logging

2014-11-20 Thread Marcelo Vanzin
Hi Tobias, With the current Yarn code, packaging the configuration in your app's jar and adding the -Dlog4j.configuration=log4jConf.xml argument to the extraJavaOptions configs should work. That's not the recommended way for get it to work, though, since this behavior may change in the future.

Re: Adding partitions to parquet data

2014-11-20 Thread Sadhan Sood
Ah awesome, thanks!! On Thu, Nov 20, 2014 at 3:01 PM, Michael Armbrust mich...@databricks.com wrote: In 1.2 by default we use Spark parquet support instead of Hive when the SerDe contains the word Parquet. This should work with hive partitioning. On Thu, Nov 20, 2014 at 10:33 AM, Sadhan

Re: Best way to store RDD data?

2014-11-20 Thread Michael Armbrust
Which SchemaRDD you can save out case classes to parquet (or JSON in Spark 1.2) automatically and when you read it back in the structure will be preserved. However, you won't get case classes when its loaded back, instead you'll get rows that you can query. There is some experimental support for

How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Blind Faith
Say I have two RDDs with the following values x = [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this.

Debug Sql execution

2014-11-20 Thread Gordon Benjamin
hey, Can anyone tell me how to debug a sql execution? Perhaps so it can show what the query is doing and how long it takes at each point?

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
You want to use RDD.union (or SparkContext.union for many RDDs). These don't join on a key. Union doesn't really do anything itself, so it is low overhead. Note that the combined RDD will have all the partitions of the original RDDs, so you may want to coalesce after the union. val x =

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
Thanks Michael, opened this https://issues.apache.org/jira/browse/SPARK-4520 On Thu, Nov 20, 2014 at 2:59 PM, Michael Armbrust mich...@databricks.com wrote: Can you open a JIRA? On Thu, Nov 20, 2014 at 10:39 AM, Sadhan Sood sadhan.s...@gmail.com wrote: I am running on master, pulled

Spark Streaming Metrics

2014-11-20 Thread Gerard Maas
As the Spark Streaming tuning guide indicates, the key indicators of a healthy streaming job are: - Processing Time - Total Delay The Spark UI page for the Streaming job [1] shows these two indicators but the metrics source for Spark Streaming (StreamingSource.scala) [2] does not. Any reasons

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Harihar Nahak
I've similar type of issue, want to join two different type of RDD in one RDD file1.txt content (ID, counts) val x : RDD[Long, Int] = sc.textFile(file1.txt).map( line = line.split(,)).map(row = (row(0).toLong, row(1).toInt) [(4407 ,40), (2064, 38), (7815 ,10), (5736,17), (8031,3)] Second RDD

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Haviv
There is probably a better way to do it but I would register both as temp tables and then join them via SQL. BR, Daniel On 20 בנוב׳ 2014, at 23:53, Harihar Nahak hna...@wynyardgroup.com wrote: I've similar type of issue, want to join two different type of RDD in one RDD file1.txt

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
Harihar, your question is the opposite of what was asked. In the future, please start a new thread for new questions. You want to do a join in your case. The join function does an inner join, which I think is what you want because you stated your IDs are common in both RDDs. For other cases you

Re: Lost executors

2014-11-20 Thread Pala M Muthaia
Just to close the loop, it seems no issues pop up when i submit the job using 'spark submit' so that the driver process also runs on a container in the YARN cluster. In the above, the driver was running on the gateway machine through which the job was submitted, which led to quite a few issues.

Re: k-means clustering

2014-11-20 Thread Jun Yang
Guys, As to the questions of pre-processing, you could just migrate your logic to Spark before using K-means. I only used Scala on Spark, and haven't used Python binding on Spark, but I think the basic steps must be the same. BTW, if your data set is big with huge sparse dimension feature

Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Harihar Nahak
Thanks Daniel , Applied Join from PairedRDD val countByUsername = file1.join(file2) .map { case (id, (username, count)) = (id, username, count) } - --Harihar -- View this message in context:

logging in workers for pyspark

2014-11-20 Thread freedafeng
Hi, I am wondering how to write logging info in a worker when running a pyspark app. I saw the thread http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html but did not see a solution. Anybody know a solution? Thanks! -- View this message in context:

Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in

RE: How to view log on yarn-client mode?

2014-11-20 Thread innowireless TaeYun Kim
Thank you. And setting yarn.log-aggregation-enable in yarm-site.xml to true was the key. It’s somewhat inconvenient that I must use ‘yarn logs’ rather than using YARN resource manager web UI after the app has completed (that is, it seems that the history server is not usable for Spark job),

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Michael Armbrust
Looks like intelij might be trying to load the wrong version of spark? On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys I am at AmpCamp 2014 at UCB right now :-) Funny Issue... This code works in Spark-Shell but throws a funny

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Jay Vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell Are you using the SBT plugin? If not how are you putting deps into IntelliJ? On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote:

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Sanjay Subramanian
Awesome that was it...Hit me with with a hockey stick :-)  unmatched Spark Core (1.0.0) and SparkSql (1.1.1) versionsCorrected that to 1.1.0 on both  dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.0/version /dependency dependency

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Sanjay Subramanian
Not using SBT...I have been creating and adapting various Spark Scala examples and put it here and all u have to do is git clone and import as maven project into IntelliJhttps://github.com/sanjaysubramanian/msfx_scala.git Sidenote , IMHO, IDEs encourage the new to Spark/Scala developers to

Re: How to view log on yarn-client mode?

2014-11-20 Thread Sandy Ryza
I agree that using yarn logs is cumbersome. We're working to improve this in future releases. On Thu, Nov 20, 2014 at 4:31 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Thank you. And setting yarn.log-aggregation-enable in yarm-site.xml to true was the key. It’s

Spark failing when loading large amount of data

2014-11-20 Thread SK
Hi, I am using sc.textFile(shared_dir/*) to load all the files in a directory on a shared partition. The total size of the files in this directory is 1.2 TB. We have a 16 node cluster with 3 TB memory (1 node is driver, 15 nodes are workers). But the loading fails after around 1 TB of data is

beeline via spark thrift doesn't retain cache

2014-11-20 Thread Judy Nash
Hi friends, I have successfully setup thrift server and execute beeline on top. Beeline can handle select queries just fine, but it cannot seem to do any kind of caching/RDD operations. i.e. 1) Command cache table doesn't work. See error: Error: Error while processing statement: FAILED:

Re: ClassNotFoundException in standalone mode

2014-11-20 Thread angel2014
Can you make sure the class SimpleApp$$anonfun$1 is included in your app jar? 2014-11-20 18:19 GMT+01:00 Benoit Pasquereau [via Apache Spark User List] ml-node+s1001560n19391...@n3.nabble.com: Hi Guys, I’m having an issue in standalone mode (Spark 1.1, Hadoop 2.4, Windows Server 2008).

Re: logging in workers for pyspark

2014-11-20 Thread angel2014
I've tried starting a Log Server on the driver according with this doc https://docs.python.org/2/howto/logging-cookbook.html so the workers could connect and send their logs to this remote server. If I remember correctly ... I didn't get any error messages and the job worked fine, but ... the

Re: MongoDB Bulk Inserts

2014-11-20 Thread Soumya Simanta
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I'm trying to use MongoDB as a destination for an ETL I'm writing in Spark. It appears I'm gaining a lot of overhead in my system databases (and possibly in the primary documents themselves); I can only assume

Another accumulator question

2014-11-20 Thread Nathan Kronenfeld
I think I understand what is going on here, but I was hoping someone could confirm (or explain reality if I don't) what I'm seeing. We are collecting data using a rather sizable accumulator - essentially, an array of tens of thousands of entries. All told, about 1.3m of data. If I understand

Re: Spark failing when loading large amount of data

2014-11-20 Thread Preeti Khurana
The disk size will not correctly reflect the size of the memory needed to store that data.Memory depends a lot on the data structures you are using . The objects that hold the data shall have their own overheads. Try with reduced input size or more memory. On 21/11/14 8:26 am, SK

Re: Persist streams to text files

2014-11-20 Thread Akhil Das
To have a single text file output for each batch you can repartition it to 1 and then call the saveAsTextFiles stream.repartition(1).saveAsTextFiles(location) On 21 Nov 2014 11:28, jishnu.prat...@wipro.com wrote: Hi I am also having similar problem.. any fix suggested.. *Originally Posted

RE: Persist streams to text files

2014-11-20 Thread jishnu.prathap
Hi Akhil Thanks for reply But it creates different directories ..I tried using filewriter but it shows non serializable error.. val stream = TwitterUtils.createStream(ssc, None) //, filters) val statuses = stream.map( status = sentimentAnalyzer.findSentiment({

Re: ClassNotFoundException in standalone mode

2014-11-20 Thread Yanbo Liang
Looks like it can not found class or jar in your Driver machine. Are you sure that the corresponding jar file exist in Driver machine rather than your develop machine? 2014-11-21 11:16 GMT+08:00 angel2014 angel.alvarez.pas...@gmail.com: Can you make sure the class SimpleApp$$anonfun$1 is

Spark serialization issues with third-party libraries

2014-11-20 Thread jatinpreet
Hi, I am planning to use UIMA library to process data in my RDDs. I have had bad experiences while using third party libraries inside worker tasks. The system gets plagued with Serialization issues. But as UIMA classes are not necessarily Serializable, I am not sure if it will work. Please

Re: Persist streams to text files

2014-11-20 Thread Akhil Das
Here's a quick version to store (append) in your local machine val tweets = TwitterUtils.createStream(ssc, None) val hashTags = tweets.flatMap(status = status.getText.split( ).filter(_.startsWith(#))) hashTags.foreachRDD(rdds = { rdds.foreach(rdd = { *val fw = new

processing files

2014-11-20 Thread Philippe de Rochambeau
Hello, I need to develop an application which: - reads xml files in thousands of directories, two levels down, from year x to year y - extracts data from image tags in those files and stores them in a Sql or NoSql database - generates ImageMagick commands based on the extracted data to