Re: Persist streams to text files

2014-11-21 Thread Prannoy
Hi , You can use FileUtil.copemerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copeMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete

Re: beeline via spark thrift doesn't retain cache

2014-11-21 Thread Yanbo Liang
1) make sure your beeline client connected to Hiveserver2 of Spark SQL. You can found execution logs of Hiveserver2 in the environment of start-thriftserver.sh. 2) what about your scale of data. If cache with small data, it will take more time to schedule workload between different executors. Look

RE: Persist streams to text files

2014-11-21 Thread jishnu.prathap
Hi Thank you ☺Akhil it worked like charm….. I used the file writer outside rdd.foreach that might be the reason for nonserialisable exception…. Thanks Regards Jishnu Menath Prathap From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Friday, November 21, 2014 1:15 PM To: Jishnu Menath

Re: processing files

2014-11-21 Thread phiroc
Hi Simon, no, I don't need to run the tasks on multiple machines for now. I will therefore stick to Makefile + shell or Java programs as Spark appears not to be the right tool for the tasks I am trying to accomplish. Thanks you for your input. Philippe - Mail original - De: Simon

Re: Spark serialization issues with third-party libraries

2014-11-21 Thread Sean Owen
You are probably casually sending UIMA objects from the driver to executors in a closure. You'll have to design your program so that you do not need to ship these objects to or from the remote task workers. On Fri, Nov 21, 2014 at 8:39 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am

How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Is there any way to get the yarn application_id inside the program? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applicationId-for-yarn-mode-both-yarn-client-and-yarn-cluster-mode-tp19462.html Sent from the Apache Spark User List mailing

spark code style

2014-11-21 Thread Kevin Jung
Hi all. Here are two code snippets. And they will produce the same result. 1. rdd.map( function ) 2. rdd.map( function1 ).map( function2 ).map( function3 ) What are the pros and cons of these two methods? Regards Kevin -- View this message in context:

Is there a way to turn on spark eventLog on the worker node?

2014-11-21 Thread Xuelin Cao
Hi, I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the *worker *node. I've tried to turn on *spark.eventLog.enabled* and set *spark.eventLog.dir* parameters on the worker node. However, it doesn't work. I

Re: Determine number of running executors

2014-11-21 Thread Yanbo Liang
You can get parameter such as spark.executor.memory, but you can not get executor or core numbers. Because executor and core are parameters of spark deploy environment not spark context. val conf = new SparkConf().set(spark.executor.memory,2G) val sc = new SparkContext(conf)

short-circuit local reads cannot be used

2014-11-21 Thread Daniel Haviv
Hi, Everytime I start the spark-shell I encounter this message: 14/11/18 00:27:43 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Any idea how to overcome it ? the short-circuit feature is a big perfomance boost I don't want to

Re: spark code style

2014-11-21 Thread Gerard Maas
I suppose that here function(x) = function3(function2(function1(x))) In that case, the difference will be modularity and readability of your program. If function{1,2,3} are logically different steps and potentially reusable somewhere else, I'd keep them separate. A sequence of map

Re: How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Finally, I've found two ways: 1. search the output with something like Submitted application application_1416319392519_0115 2. use specific AppName. We could query the ApplicationID(yarn) -- View this message in context:

Re: Another accumulator question

2014-11-21 Thread Sean Owen
This sounds more like a use case for reduce? or fold? it sounds like you're kind of cobbling together the same function on accumulators, when reduce/fold are simpler and have the behavior you suggest. On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I think I

How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Hi, I got an error during rdd.registerTempTable(...) saying scala.MatchError: scala.BigInt Looks like BigInt cannot be used in SchemaRDD, is that correct? So what would you recommend to deal with it? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog:

Re: Spark Streaming Metrics

2014-11-21 Thread Gerard Maas
Looks like metrics are not a hot topic to discuss - yet so important to sleep well when jobs are running in production. I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537 to track this issue. -kr, Gerard. On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com

Re: Why is ALS class serializable ?

2014-11-21 Thread Hao Ren
It makes sense. Thx. =) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-ALS-class-serializable-tp19262p19472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark: Simple local test failed depending on memory settings

2014-11-21 Thread rzykov
Dear all, We encountered problems of failed jobs with huge amount of data. A simple local test was prepared for this question at https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb It generates 2 sets of key-value pairs, join them, selects distinct values and counts data finally. object

Re: Cores on Master

2014-11-21 Thread Prannoy
Hi, You can also set the cores in the spark application itself . http://spark.apache.org/docs/1.0.1/spark-standalone.html On Wed, Nov 19, 2014 at 6:11 AM, Pat Ferrel-2 [via Apache Spark User List] ml-node+s1001560n19238...@n3.nabble.com wrote: OK hacking the start-slave.sh did it On Nov

Re: Slow performance in spark streaming

2014-11-21 Thread Prannoy
Hi, Spark runs in local with a speed less than in cluster. Cluster machines usually have a high configuration and also the tasks are distrubuted in workers in order to get a faster result. So you will always find a difference in speed when running in local and when running in cluster. Try running

Re: Parsing a large XML file using Spark

2014-11-21 Thread Prannoy
Hi, Parallel processing of xml files may be an issue due to the tags in the xml file. The xml file has to be intact as while parsing it matches the start and end entity and if its distributed in parts to workers possibly it may or may not find start and end tags within the same worker which will

Re: How can I read this avro file using spark scala?

2014-11-21 Thread thomas j
Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my

Re: How can I read this avro file using spark scala?

2014-11-21 Thread thomas j
I've been able to load a different avro file based on GenericRecord with: val person = sqlContext.avroFile(/tmp/person.avro) When I try to call `first()` on it, I get NotSerializableException exceptions again: person.first() ... 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage

RE: tableau spark sql cassandra

2014-11-21 Thread jererc
Hi! Sure, I'll post the info I grabbed once the cassandra tables values appear in Tableau. Best, Jerome -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html Sent from the Apache Spark User List mailing list

Setup Remote HDFS for Spark

2014-11-21 Thread EH
Hi, Are there any way that I can setup a remote HDFS for Spark (more specific, for Spark Streaming checkpoints)? The reason I'm asking is that our Spark and HDFS do not run on the same machines. I've been looked around but still no clue so far. Thanks, EH -- View this message in context:

Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-21 Thread Prannoy
Hi naveen, I dont think this is possible. If you are setting the master with your cluster details you cannot execute any job from your local machine. You have to execute the jobs inside your yarn machine so that sparkconf is able to connect with all the provided details. If this is not the case

Re: Setup Remote HDFS for Spark

2014-11-21 Thread EH
Unfortunately whether it is possible to have both Spark and HDFS running on the same machine is not under our control. :( Right now we have Spark and HDFS running in different machines. In this case, is it still possible to hook up a remote HDFS with Spark so that we can use Spark Streaming

Re: RDD data checkpoint cleaning

2014-11-21 Thread Luis Ángel Vicente Sánchez
I have seen the same behaviour while testing the latest spark 1.2.0 snapshot. I'm trying the ReliableKafkaReceiver and it works quite well but the checkpoints folder is always increasing in size. The receivedMetaData folder remains almost constant in size but the receivedData folder is always

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Yin Huai
Hello Jianshi, The reason of that error is that we do not have a Spark SQL data type for Scala BigInt. You can use Decimal for your case. Thanks, Yin On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got an error during rdd.registerTempTable(...) saying

Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine.

Re: How can I read this avro file using spark scala?

2014-11-21 Thread Simone Franzini
I have also been struggling with reading avro. Very glad to hear that there is a new avro library coming in Spark 1.2 (which by the way, seems to have a lot of other very useful improvements). In the meanwhile, I have been able to piece together several snippets/tips that I found from various

Re: Nightly releases

2014-11-21 Thread Arun Ahuja
Great - what can we do to make this happen? So should I file a JIRA to track? Thanks, Arun On Tue, Nov 18, 2014 at 11:46 AM, Andrew Ash and...@andrewash.com wrote: I can see this being valuable for users wanting to live on the cutting edge without building CI infrastructure themselves,

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
We've done this with reduce - that definitely works. I've reworked the logic to use accumulators because, when it works, it's 5-10x faster On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen so...@cloudera.com wrote: This sounds more like a use case for reduce? or fold? it sounds like you're kind of

Many retries for Python job

2014-11-21 Thread Brett Meyer
I¹m running a Python script with spark-submit on top of YARN on an EMR cluster with 30 nodes. The script reads in approximately 3.9 TB of data from S3, and then does some transformations and filtering, followed by some aggregate counts. During Stage 2 of the job, everything looks to complete

Re: Nightly releases

2014-11-21 Thread Arun Ahuja
Great - posted here https://issues.apache.org/jira/browse/SPARK-4542 On Fri, Nov 21, 2014 at 1:03 PM, Andrew Ash and...@andrewash.com wrote: Yes you should file a Jira and echo it out here so others can follow and comment on it. Thanks Arun! On Fri, Nov 21, 2014 at 12:02 PM, Arun Ahuja

SparkSQL Timestamp query failure

2014-11-21 Thread whitebread
Hi all, I put some log files into sql tables through Spark and my schema looks like this: |-- timestamp: timestamp (nullable = true) |-- c_ip: string (nullable = true) |-- cs_username: string (nullable = true) |-- s_ip: string (nullable = true) |-- s_port: string (nullable = true) |--

Re: Many retries for Python job

2014-11-21 Thread Sandy Ryza
Hi Brett, Are you noticing executors dying? Are you able to check the YARN NodeManager logs and see whether YARN is killing them for exceeding memory limits? -Sandy On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer brett.me...@crowdstrike.com wrote: I’m running a Python script with spark-submit

Re: Parsing a large XML file using Spark

2014-11-21 Thread Paul Brown
Unfortunately, unless you impose restrictions on the XML file (e.g., where namespaces are declared, whether entity replacement is used, etc.), you really can't parse only a piece of it even if you have start/end elements grouped together. If you want to deal effectively (and scalably) with large

Re: Determine number of running executors

2014-11-21 Thread Sandy Ryza
Hi Tobias, One way to find out the number of executors is through SparkContext#getExecutorMemoryStatus. You can find out the number of by asking the SparkConf for the spark.executor.cores property, which, if not set, means 1 for YARN. -Sandy On Fri, Nov 21, 2014 at 1:30 AM, Yanbo Liang

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
Actually, it's a real On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote: Hi, see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one solution. One issue with those XML files is that they cannot be processed line by line in parallel; plus you

Re: Extracting values from a Collecion

2014-11-21 Thread Sanjay Subramanian
I am sorry the last line in the code is  file1Rdd.join(file2RddGrp.mapValues(names = names.toSet)).collect().foreach(println) so  My Code===val file1Rdd = sc.textFile(/Users/sansub01/mycode/data/songs/names.txt).map(x = (x.split(,)(0), x.split(,)(1)))val file2Rdd =

Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
(sorry about the previous spam... google inbox didn't allowed me to cancel the miserable sent action :-/) So what I was about to say: it's a real PAIN tin the ass to parse the wikipedia articles in the dump due to this mulitline articles... However, there is a way to manage that quite easily,

Re: JVM Memory Woes

2014-11-21 Thread Peter Thai
Quick update: It is a filter job that creates the error above, not the reduceByKey Why would a filter cause an out of memory? Here is my code val inputgsup =hdfs://+sparkmasterip+/user/sense/datasets/gsup/binary/30/2014/11/0[1-9]/part*; val gsupfile =

Re: Many retries for Python job

2014-11-21 Thread Brett Meyer
According to the web UI I don¹t see any executors dying during Stage 2. I looked over the YARN logs and didn¹t see anything suspicious, but I may not have been looking closely enough. Stage 2 seems to complete just fine, it¹s just when it enters Stage 3 that the results from the previous stage

RE: tableau spark sql cassandra

2014-11-21 Thread Mohammed Guller
Thanks, Jerome. BTW, have you tried the CalliopeServer2 from tuplejump? I was able to quickly connect from beeline/Squirrel to my Cassandra cluster using CalliopeServer2, which extends Spark SQL Thrift Server. It was very straight forward. Next step is to connect from Tableau, but I can't find

Re: MLLib: LinearRegressionWithSGD performance

2014-11-21 Thread Jayant Shekhar
Hi Sameer, You can try increasing the number of executor-cores. -Jayant On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I have been using MLLib's linear regression and I have some question regarding the performance. We have a cluster of 10 nodes -- each

Re: MLLib: LinearRegressionWithSGD performance

2014-11-21 Thread Jayant Shekhar
Hi Sameer, You can also use repartition to create a higher number of tasks. -Jayant On Fri, Nov 21, 2014 at 12:02 PM, Jayant Shekhar jay...@cloudera.com wrote: Hi Sameer, You can try increasing the number of executor-cores. -Jayant On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak

Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Alaa Ali
I want to run queries on Apache Phoenix which has a JDBC driver. The query that I want to run is: select ts,ename from random_data_date limit 10 But I'm having issues with the JdbcRDD upper and lowerBound parameters (that I don't actually understand). Here's what I have so far: import

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Josh Mahonin
Hi Alaa Ali, In order for Spark to split the JDBC query in parallel, it expects an upper and lower bound for your input data, as well as a number of partitions so that it can split the query across multiple tasks. For example, depending on your data distribution, you could set an upper and lower

Re: MongoDB Bulk Inserts

2014-11-21 Thread Benny Thompson
I tried using RDD#mapPartitions but my job completes prematurely and without error as if nothing gets done. What I have is fairly simple sc .textFile(inputFile) .map(parser.parse) .mapPartitions(bulkLoad) But the Iterator[T] of mapPartitions

Running Spark application from Tomcat

2014-11-21 Thread Andreas Koch
I have a Spark java application that I run in local-mode. As such it runs without any issues. Now, I would like to run it as a webservice from Tomcat. The first issue I had with this was that the spark-assembly jar contains javax.servlet, which Tomcat does not allow. Therefore I removed

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Alaa Ali
Awesome, thanks Josh, I missed that previous post of yours! But your code snippet shows a select statement, so what I can do is just run a simple select with a where clause if I want to, and then run my data processing on the RDD to mimic the aggregation I want to do with SQL, right? Also, another

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Alex Kamil
Ali, just create a BIGINT column with numeric values in phoenix and use sequences http://phoenix.apache.org/sequences.html to populate it automatically I included the setup below in case someone starts from scratch Prerequisites: - export JAVA_HOME, SCALA_HOME and install sbt - install hbase in

Persist kafka streams to text file

2014-11-21 Thread Joanne Contact
Hello I am trying to read kafka stream to a text file by running spark from my IDE (IntelliJ IDEA) . The code is similar as a previous thread on persisting stream to a text file. I am new to spark or scala. I believe the spark is on local mode as the console shows 14/11/21 14:17:11 INFO

Persist kafka streams to text file, tachyon error?

2014-11-21 Thread Joanne Contact
use the right email list. -- Forwarded message -- From: Joanne Contact joannenetw...@gmail.com Date: Fri, Nov 21, 2014 at 2:32 PM Subject: Persist kafka streams to text file To: u...@spark.incubator.apache.org Hello I am trying to read kafka stream to a text file by running spark

Re: SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Evan Chan
I would expect an SQL query on c would fail because c would not be known in the schema of the older Parquet file. What I'd be very interested in is how to add a new column as an incremental new parquet file, and be able to somehow join the existing and new file, in an efficient way. IE, somehow

Re: Another accumulator question

2014-11-21 Thread Andrew Ash
Hi Nathan, It sounds like what you're asking for has already been filed as https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match what you're proposing? Andrew On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We've done this with reduce -

RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD

Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you? My0.2c Le ven 21 nov. 2014 15:26,

Book: Data Analysis with SparkR

2014-11-21 Thread Emaasit
Is the a book on SparkR for the absolute terrified beginner? I use R for my daily analysis and I am interested in a detailed guide to using SparkR for data analytics: like a book or online tutorials. If there's any please direct me to the address. Thanks, Daniel -- View this message in

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Ah yes. I found it too in the manual. Thanks for the help anyway! Since BigDecimal is just a wrapper around BigInt, let's also convert to BigInt to Decimal. I created a ticket. https://issues.apache.org/jira/browse/SPARK-4549 Jianshi On Fri, Nov 21, 2014 at 11:30 PM, Yin Huai

Missing parents for stage (Spark Streaming)

2014-11-21 Thread YaoPau
When I submit a Spark Streaming job, I see these INFO logs printing frequently: 14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216) 14/11/21 18:53:17 INFO DAGScheduler: failed: Set() 14/11/21 18:53:17 INFO DAGScheduler: Missing parents for Stage 216: List() 14/11/21 18:53:17 INFO

Re: Book: Data Analysis with SparkR

2014-11-21 Thread Zongheng Yang
Hi Daniel, Thanks for your email! We don't have a book (yet?) specifically on SparkR, but here's a list of helpful tutorials / links you can check out (I am listing them in roughly basic - advanced order): - AMPCamp5 SparkR exercises http://ampcamp.berkeley.edu/5/exercises/sparkr.html. This

Re: MongoDB Bulk Inserts

2014-11-21 Thread Soumya Simanta
bulkLoad has the connection to MongoDB ? On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I tried using RDD#mapPartitions but my job completes prematurely and without error as if nothing gets done. What I have is fairly simple sc

allocating different memory to different executor for same application

2014-11-21 Thread tridib
Hello Experts, I have 5 worker machines with different size of RAM. is there a way to configure it with different executor memory? Currently I see that all worker spins up 1 executor with same amount of memory. Thanks Regards Tridib -- View this message in context:

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld
Im not sure if it's an exact match, or just very close :-) I don't think our problem is the workload on the driver, I think it's just memory - so while the solution proposed there would work, it would also be sufficient for our purposes, I believe, simply to clear each block as soon as it's added

spark-sql broken

2014-11-21 Thread tridib
After taking today's build from master branch I started getting this error when run spark-sql: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. I used following command for building: ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.4.0 -Phadoop-2.4

latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-21 Thread Judy Nash
Hi, Thrift server is failing to start for me on latest spark 1.2 branch. I got the error below when I start thrift server. Exception in thread main java.lang.NoClassDefFoundError: com/google/common/bas e/Preconditions at org.apache.hadoop.conf.Configuration$DeprecationDelta.init(Configur

Spark streaming job failing after some time.

2014-11-21 Thread pankaj channe
I have seen similar posts on this issue but could not find solution. Apologies if this has been discussed here before. I am running a spark streaming job with yarn on a 5 node cluster. I am using following command to submit my streaming job. spark-submit --class class_name --master yarn-cluster

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-21 Thread Cheng Lian
Hi Judy, could you please provide the commit SHA1 of the version you're using? Thanks! On 11/22/14 11:05 AM, Judy Nash wrote: Hi, Thrift server is failing to start for me on latest spark 1.2 branch. I got the error below when I start thrift server. Exception in thread main