No space left on device

2014-08-09 Thread kmatzen
I need some configuration / debugging recommendations to work around no space left on device. I am completely new to Spark, but I have some experience with Hadoop. I have a task where I read images stored in sequence files from s3://, process them with a map in scala, and write the result back

set SPARK_LOCAL_DIRS issue

2014-08-09 Thread Baoqiang Cao
Hi I’m trying to using a specific dir for spark working directory since I have limited space at /tmp. I tried: 1) export SPARK_LOCAL_DIRS=“/mnt/data/tmp” or 2) SPARK_LOCAL_DIRS=“/mnt/data/tmp” in spark-env.sh But neither worked, since the output of spark still saying ERROR

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-09 Thread Fengyun RAO
Although nobody answers the Two questions, in my practice, it seems both are yes. 2014-08-04 19:50 GMT+08:00 Fengyun RAO raofeng...@gmail.com: object LogParserWrapper { private val logParser = { val settings = new ... val builders = new new

Re: KMeans Input Format

2014-08-09 Thread AlexanderRiggers
Thank you for your help. After restructuring my code to Seans input, it worked without changing Spark context. I now took the same file format just a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-09 Thread Kevin James Matzen
I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup(). I also had a use case where it was so expensive to initialize the non-serializable object that I would make it a static member of the mapper, turn on JVM reuse across tasks, and then

How to read zip files from HDFS into spark-shell using scala

2014-08-09 Thread Alton Alexander
I've tried uploading a zip file that contains a csv to hdfs and then read it into spark using spark-shell and the first line is all messed up. However when i upload a gzip to hdfs and then read it into spark it does just fine. See output below: Is there a way to read a zip file as is from hdfs in

Re: No space left on device

2014-08-09 Thread Matei Zaharia
Your map-only job should not be shuffling, but if you want to see what's running, look at the web UI at http://driver:4040. In fact the job should not even write stuff to disk except inasmuch as the Hadoop S3 library might build up blocks locally before sending them on. My guess is that it's

feature space search

2014-08-09 Thread filipus
i am wondering if i can use spark in order to search for interesting featrures/attributes for modelling. In fact I just come from some introductional sites about vowpal wabbit. i some how like the idea of out of the core modelling. well, i have transactional data where customers purchased

Re: No space left on device

2014-08-09 Thread Jim Donahue
Root partitions on AWS instances tend to be small (for example, an m1.large instance has 2 420 GB drives, but only a 10 GB root partition). Matei's probably right on about this - just need to be careful where things like the logs get stored. From: Matei Zaharia

Overriding dstream window definition

2014-08-09 Thread Ruchir Jha
Hi I intend on using the same Spark Streaming program for both real time and batch processing of my time stamped data. However with batch processing all window based operations would be meaningless because (I assume) the window is defined by the arrival times of data and it is not possible to

Spark SQL JSON dataset query nested datastructures

2014-08-09 Thread Sathish Kumaran Vairavelu
I have a simple JSON dataset as below. How do I query all parts.lock for the id=1. JSON: { id: 1, name: A green door, price: 12.50, tags: [home, green], parts : [ { lock : One lock, key : single key }, { lock : 2 lock, key : 2 key } ] } Query: select id,name,price,parts.lockfrom product where