Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote: Hello Akhil, thank you for your continued help! 1) So, if I can write it in

Configration Problem? (need help to get Spark job executed)

2015-02-14 Thread NORD SC
Hi all, I am new to spark and seem to have hit a common newbie obstacle. I have a pretty simple setup and job but I am unable to get past this error when executing a job: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark Just read this...seems like it should be easily readable. Thanks! On Sat, Feb 14, 2015 at 1:36 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the link. Is there a reason why there is a new directory created

Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt ​If you want to do it pro grammatically, then you can use the ​ FileUtil.copyMerge API ​.​ like: FileUtil.copyMerge(FileSystem of source(hdfs),

Strategy to automatically configure spark workers env params in standalone mode

2015-02-14 Thread Mike Sam
We are planning to use varying servers spec (32 GB, 64GB, 244GB RAM or even higher and varying cores) for an standalone deployment of spark but we do not know the spec of the server ahead of time and we need to script up some logic that will run on the server on boot and automatically set the

Re: SQLContext.applySchema strictness

2015-02-14 Thread nitin
AFAIK, this is the expected behavior. You have to make sure that the schema matches the row. It won't give any error when you apply the schema as it doesn't validate the nature of data. -- View this message in context:

Re: failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

2015-02-14 Thread Matthew Cornell
Oops! I forgot to excerpt the errors and warnings from that file: 15/02/12 08:02:03 ERROR TaskSchedulerImpl: Lost executor 4 on compute-0-3.wright: remote Akka client disassociated 15/02/12 08:03:00 WARN TaskSetManager: Lost task 1.0 in stage 28.0 (TID 37, compute-0-1.wright):

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me

Spark Web UI Doesn't Open in Yarn-Client Mode

2015-02-14 Thread Puneet Kumar Ojha
Hi, I am running 3 mode spark cluster on EMR. While running job I see 1 executor running? Does that mean only 1 of the node is being used? ( Seems from Spark Documentation on default mode (LOCAL). When I switch to yarn-client mode the Spark Web UI doesn't open. How to view the job running

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala? On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can directly write to hbase

Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating

SparkStreaming Low Performance

2015-02-14 Thread Akhil Das
I'm getting a low performance while parsing json data. My cluster setup is 1.2.0 version of spark with 10 Nodes each having 15Gb of memory and 4 cores. I tried both scala.util.parsing.json.JSON and and fasterxml's Jackson parser. This is what i basically do: *//Approach 1:* val jsonStream =

Re: Is Ubuntu server or desktop better for spark cluster

2015-02-14 Thread Sean Owen
I don't think this is very specific to Spark. Spark is not a desktop user application. While it is perfectly possible to run it on such a distro, I would think a server distro is more appropriate. On Sat, Feb 14, 2015 at 3:05 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi gurus, I am

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
I see. I'd really benchmark how the parsing performs outside Spark (in a tight loop or something). If *that* is slow, you know it's the parsing. If not, it's not the parsing. Another thing you want to look at is CPU usage. If the actual parsing really is the bottleneck, you should see very high

Re: Spark Web UI Doesn't Open in Yarn-Client Mode

2015-02-14 Thread Silvio Fiorito
on yarn you need to first go to the resource manager UI, find your job, and click the link for the UI there. From: Puneet Kumar Ojhamailto:puneet.ku...@pubmatic.com Sent: ?Saturday?, ?February? ?14?, ?2015 ?5?:?25? ?AM To: user@spark.apache.orgmailto:user@spark.apache.org Hi, I am running 3

Re: SparkException: Task not serializable - Jackson Json

2015-02-14 Thread mickdelaney
to get past this you can move the mapper creation code down into the closure. its then created on the worker node so it doesnt need to be serialized. // Parse it into a specific case class. We use flatMap to handle errors // by returning an empty list (None) if we encounter an issue and a //

Re: SparkStreaming Low Performance

2015-02-14 Thread Akhil Das
Ah my bad, it works without serializable exception. But not much performance difference is there though. Thanks Best Regards On Sat, Feb 14, 2015 at 7:45 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Thanks for the suggestion, but doing that gives me this exception:

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
Huh, that would come to 6.5ms per one JSON. That does feel like a lot but if your JSON file is big enough, I guess you could get that sort of processing time. Jackson is more or less the most efficient JSON parser out there, so unless the Scala API is somehow affecting it, I don't see any better

Is Ubuntu server or desktop better for spark cluster

2015-02-14 Thread Joanne Contact
Hi gurus, I am trying to install a real linux machine(not VM) where i will install spark also Hadoop. I plan on learning the clusters. I found Ubuntu has desktop and server versions. Do it matter? Thanks!! J - To

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
(adding back user) Fair enough. Regarding serialization exception, the hack I use is to have a object with a transient lazy field, like so: object Holder extends Serializable { @transient lazy val mapper = new ObjectMapper() } This way, the ObjectMapper will be instantiated at the

Re: SparkStreaming Low Performance

2015-02-14 Thread Akhil Das
Thanks for the suggestion, but doing that gives me this exception: http://pastebin.com/ni80NqKn Over this piece of code: object Holder extends Serializable { @transient lazy val mapper = new ObjectMapper() with ScalaObjectMapper mapper.registerModule(DefaultScalaModule) }

Re: SparkStreaming Low Performance

2015-02-14 Thread Akhil Das
Thanks again! Its with the parser only, just tried the parser https://gist.github.com/akhld/3948a5d91d218eaf809d without Spark. And it took me 52 Sec to process 8k json records. Not sure if there's an efficient way to do this in Spark, i know if i use sparkSQL with schemaRDD and all it will be

Re: Is Ubuntu server or desktop better for spark cluster

2015-02-14 Thread Deepak Vohra
For a beginner Ubuntu Desktop is recommended as it includes a GUI and is easier to install. Also referServerFaq - Community Help Wiki |   | |   |   |   |   |   | | ServerFaq - Community Help WikiFrequently Asked Questions about the Ubuntu Server Edition This Frequently Asked Questions document

Re: SparkSQL and star schema

2015-02-14 Thread Michael Armbrust
Yes. Though for good performance it is usually important to make sure that you have statistics for the smaller dimension tables. Today that can be done by creating them in the hive metastore and running ANALYZE TABLE table COMPUTE STATISTICS noscan. In Spark 1.3 this will happen automatically

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :- Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
Same problem here, shuffle write increased from 10G to over 64G, since I'm running on amazon EC2 this always cause temporary folder to consume all the disk space. Still looking for a solution. BTW, the 64G shuffle write is encountered on shuffling a pairRDD with HashPartitioner, so its not

Re: SQLContext.applySchema strictness

2015-02-14 Thread Michael Armbrust
Doing runtime type checking is very expensive, so we only do it when necessary (i.e. you perform an operation like adding two columns together) On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote: AFAIK, this is the expected behavior. You have to make sure that the schema

Re: Build spark failed with maven

2015-02-14 Thread Olivier Girardot
Hi, this was not reproduced for me, what kind of jdk are you using for the zinc server ? Regards, Olivier. 2015-02-11 5:08 GMT+01:00 Yi Tian tianyi.asiai...@gmail.com: Hi, all I got an ERROR when I build spark master branch with maven (commit: 2d1e916730492f5d61b97da6c483d3223ca44315)

Re: SQLContext.applySchema strictness

2015-02-14 Thread Nicholas Chammas
Would it make sense to add an optional validate parameter to applySchema() which defaults to False, both to give users the option to check the schema immediately and to make the default behavior clearer? ​ On Sat Feb 14 2015 at 9:18:59 AM Michael Armbrust mich...@databricks.com wrote: Doing

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as *directories*,

Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Okay, got it, thanks for the help Sean! On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen so...@cloudera.com wrote: No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a

Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files. On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote: Thanks Sean and