Re: Rule Engine for Spark

2015-11-04 Thread Daniel Mahler
I am not familiar with any rule engines on Spark Streaming or even plain Spark Conceptually closest things I am aware of are Datomic and Bloom-lang. Neither of them are Spark based but they implement Datalog like languages over distributed stores. - http://www.datomic.com/ -

Error using json4s with Apache Spark in spark-shell

2015-06-11 Thread Daniel Mahler
the extraction to work when distributed to the workers? Wesley Miao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application. thanks Daniel Mahler I am trying to use the case class extraction feature

tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Daniel Mahler
I have a cluster launched with spark-ec2. I can see a TachyonMaster process running, but I do not seem to be able to use tachyon from the spark-shell. if I try rdd.saveAsTextFile(tachyon://localhost:19998/path) I get 15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID

Re: problem writing to s3

2015-04-23 Thread Daniel Mahler
, Apr 23, 2015 at 12:11 AM, Daniel Mahler dmah...@gmail.com wrote: Hi Akhil, It works fine when outprefix is a hdfs:///localhost/... url. It looks to me as if there is something about spark writing to the same s3 bucket it is reading from. That is the only real difference between the 2

spark-ec2 s3a filesystem support and hadoop versions

2015-04-22 Thread Daniel Mahler
I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version

problem writing to s3

2015-04-21 Thread Daniel Mahler
I am having a strange problem writing to s3 that I have distilled to this minimal example: def jsonRaw = s${outprefix}-json-raw def jsonClean = s${outprefix}-json-clean val txt = sc.textFile(inpath)//.coalesce(shards, false) txt.count val res = txt.saveAsTextFile(jsonRaw) val txt2 =

HiveContext vs SQLContext

2015-04-20 Thread Daniel Mahler
Is HiveContext still preferred over SQLContext? What are the current (1.3.1) diferences between them? thanks Daniel

Re: Problem getting program to run on 15TB input

2015-04-13 Thread Daniel Mahler
Sometimes a large number of partitions leads to memory problems. Something like val rdd1 = sc.textFile(file1).coalesce(500). ... val rdd2 = sc.textFile(file2).coalesce(500). ... may help. On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra arun.lut...@gmail.com wrote: Everything works smoothly if I

Cleaning/transforming json befor converting to SchemaRDD

2014-11-03 Thread Daniel Mahler
I am trying to convert terabytes of json log files into parquet files. but I need to clean it a little first. I end up doing the following txt = sc.textFile(inpath).coalesce(800) val json = (for { line - txt JObject(child) = parse(line) child2 = (for {

union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel

Re: union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-10-30 Thread Daniel Mahler
or hdfs-site.xml file under /root/ephemeral-hdfs/etc/hadoop/ where you can see data node dir property which will be a comma separated list of volumes. Thanks Best Regards On Thu, Oct 30, 2014 at 5:21 AM, Daniel Mahler dmah...@gmail.com wrote: I started my ec2 spark cluster with ./ec2

use additional ebs volumes for hsdf storage with spark-ec2

2014-10-29 Thread Daniel Mahler
I started my ec2 spark cluster with ./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10 launch mycluster I see the additional volumes attached but they do not seem to be set up for hdfs. How can I check if they are being utilized on all workers, and how can I get all workers

Fwd: Saving very large data sets as Parquet on S3

2014-10-24 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8) data.registerAsTable(data) data.saveAsParquetFile(s3n://target/path) This

Saving very large data sets as Parquet on S3

2014-10-20 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8) data.registerAsTable(data) data.saveAsParquetFile(s3n://target/path) This

Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
by the input format. Nick ​ On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote: Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
-release-engineering/ On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote: I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote: I agree that for updating rsync is probably preferable, and it seems like for that purpose it would also parallelize well since most of the time is spent computing checksums so the process is not constrained by the total

sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever

Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Daniel Mahler
I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be