Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS

Re: How to read a multipart s3 file?

2014-05-11 Thread Nicholas Chammas
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote: I was using s3n:// but I got frustrated by how slow it is at writing files. I'm curious: How slow is slow? How long does it take you, for example, to save a 1GB file to S3 using s3n vs s3?

Spark LIBLINEAR

2014-05-11 Thread Chieh-Yen
Dear all, Recently we released a distributed extension of LIBLINEAR at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/ Currently, TRON for logistic regression and L2-loss SVM is supported. We provided both MPI and Spark implementations. This is very preliminary so your

Re: Is there anything that I need to modify?

2014-05-11 Thread Arpit Tak
Try setting hostname to domain setting in /etc/hosts . Its not able to resolve ip to hostname try this ... localhost 192.168.10.220 CHBM220 On Wed, May 7, 2014 at 12:50 PM, Sophia sln-1...@163.com wrote: [root@CHBM220 spark-0.9.1]#

答复: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-11 Thread Francis . Hu
I have just the problem resolved via running master and work daemons individually on where they are. if I execute the shell: sbin/start-all.sh , the problem always exist. 发件人: Francis.Hu [mailto:francis...@reachjunction.com] 发送时间: Tuesday, May 06, 2014 10:31 收件人: user@spark.apache.org

Re: How can adding a random count() change the behavior of my program?

2014-05-11 Thread Walrus theCat
Nick, I have encountered strange things like this before (usually when programming with mutable structures and side-effects), and for me, the answer was that, until .count (or .first, or similar), is called, your variable 'a' refers to a set of instructions that only get executed to form the

Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
Yep. I figured that out. I uncompressed the file and it looks much faster now. Thanks. On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: .gz files are not splittable hence harder to process. Easiest is to move to a splittable compression like lzo and break file

Re: Test

2014-05-11 Thread Azuryy
Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote: This message has no content.

Re: Is there any problem on the spark mailing list?

2014-05-11 Thread lukas nalezenec
There was an outage: https://blogs.apache.org/infra/entry/mail_outage On Fri, May 9, 2014 at 1:27 PM, wxhsdp wxh...@gmail.com wrote: i think so, fewer questions and answers these three days -- View this message in context:

Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-11 Thread Madhu
Svend, I built it on my iMac and it was about the same speed as Windows 7, RHEL 6 VM on Windows 7, and Linux on EC2. Spark is pleasantly easy to build on all of these platforms, which is wonderful. How long does it take to start spark-shell? Maybe it's a JVM memory setting problem on your

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task

Re: is Mesos falling out of favor?

2014-05-11 Thread Gary Malouf
For what it is worth, our team here at MediaCrossinghttp://mediacrossing.com has been using the Spark/Mesos combination since last summer with much success (low operations overhead, high developer performance). IMO, Hadoop is overcomplicated from both a development and operations perspective so I

Re: Spark LIBLINEAR

2014-05-11 Thread Debasish Das
Hello Prof. Lin, Awesome news ! I am curious if you have any benchmarks comparing C++ MPI with Scala Spark liblinear implementations... Is Spark Liblinear apache licensed or there are any specific restrictions on using it ? Except using native blas libraries (which each user has to manage by

Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta
Will sbt-pack and the maven solution work for the Scala REPL? I need the REPL because it save a lot of time when I'm playing with large data sets because I load then once, cache them and then try out things interactively before putting in a standalone driver. I've sbt woking for my own

Re: Comprehensive Port Configuration reference?

2014-05-11 Thread Mark Baker
On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com wrote: In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job. These dynamic ports make securing the Spark network difficult. Indeed. Judging by

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote: You got a good point there, those APIs should probably be marked as @DeveloperAPI. Would you mind filing a JIRA for that ( https://issues.apache.org/jira/browse/SPARK)? On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers

Re: How to use spark-submit

2014-05-11 Thread Stephen Boesch
HI Sonal, Yes I am working towards that same idea. How did you go about creating the non-spark-jar dependencies ? The way I am doing it is a separate straw-man project that does not include spark but has the external third party jars included. Then running sbt compile:managedClasspath and

Re: Test

2014-05-11 Thread Aaron Davidson
I didn't get the original message, only the reply. Ruh-roh. On Sun, May 11, 2014 at 8:09 AM, Azuryy azury...@gmail.com wrote: Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia

streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-11 Thread zzzzzqf12345
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. Here is

build shark(hadoop CDH5) on hadoop2.0.0 CDH4

2014-05-11 Thread Sophia
I have built shark in sbt way,but the sbt exception turn out: [error] sbt.resolveException:unresolved dependency: org.apache.hadoop#hadoop-client;2.0.0: not found. How can I do to build it well? -- View this message in context:

Re: Is there any problem on the spark mailing list?

2014-05-11 Thread ankurdave
I haven't been getting mail either. This was the last message I received: http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html -- View this message in context:

about spark interactive shell

2014-05-11 Thread fengshen
hi,all I am now using spark in production. but I notice spark driver including rdd and dag... and the executors will try to register with the driver. I think the driver should run on the cluster.and client should run on the gateway. Similar like: