I am not familiar with any rule engines on Spark Streaming or even plain
Spark
Conceptually closest things I am aware of are Datomic and Bloom-lang.
Neither of them are Spark based but they implement Datalog like languages
over distributed stores.
- http://www.datomic.com/
-
the extraction to work when distributed to the workers?
Wesley Miao has reproduced the problem and found that it is specific to
spark-shell. He reports that this code works as a standalone application.
thanks
Daniel Mahler
I am trying to use the case class extraction feature
I have a cluster launched with spark-ec2.
I can see a TachyonMaster process running,
but I do not seem to be able to use tachyon from the spark-shell.
if I try
rdd.saveAsTextFile(tachyon://localhost:19998/path)
I get
15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID
, Apr 23, 2015 at 12:11 AM, Daniel Mahler dmah...@gmail.com wrote:
Hi Akhil,
It works fine when outprefix is a hdfs:///localhost/... url.
It looks to me as if there is something about spark writing to the same
s3 bucket it is reading from.
That is the only real difference between the 2
I would like to easily launch a cluster that supports s3a file systems.
if I launch a cluster with `spark-ec2 --hadoop-major-version=2`,
what determines the minor version of hadoop?
Does it depend on the spark version being launched?
Are there other allowed values for --hadoop-major-version
I am having a strange problem writing to s3 that I have distilled to this
minimal example:
def jsonRaw = s${outprefix}-json-raw
def jsonClean = s${outprefix}-json-clean
val txt = sc.textFile(inpath)//.coalesce(shards, false)
txt.count
val res = txt.saveAsTextFile(jsonRaw)
val txt2 =
Is HiveContext still preferred over SQLContext?
What are the current (1.3.1) diferences between them?
thanks
Daniel
Sometimes a large number of partitions leads to memory problems.
Something like
val rdd1 = sc.textFile(file1).coalesce(500). ...
val rdd2 = sc.textFile(file2).coalesce(500). ...
may help.
On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra arun.lut...@gmail.com wrote:
Everything works smoothly if I
I am trying to convert terabytes of json log files into parquet files.
but I need to clean it a little first.
I end up doing the following
txt = sc.textFile(inpath).coalesce(800)
val json = (for {
line - txt
JObject(child) = parse(line)
child2 = (for {
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB))
but that just returns RDD[Row].
How do I combine them to get a SchemaRDD[Row]?
thanks
Daniel
that keeps the
schema on the results.
Matei
On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote:
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB))
but that just returns RDD[Row].
How do
or hdfs-site.xml file under
/root/ephemeral-hdfs/etc/hadoop/ where you can see data node dir property
which will be a comma separated list of volumes.
Thanks
Best Regards
On Thu, Oct 30, 2014 at 5:21 AM, Daniel Mahler dmah...@gmail.com wrote:
I started my ec2 spark cluster with
./ec2
I started my ec2 spark cluster with
./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10
launch mycluster
I see the additional volumes attached but they do not seem to be set up for
hdfs.
How can I check if they are being utilized on all workers,
and how can I get all workers
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8)
data.registerAsTable(data)
data.saveAsParquetFile(s3n://target/path)
This
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8)
data.registerAsTable(data)
data.saveAsParquetFile(s3n://target/path)
This
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on
you report on how many partitions your RDDs
have?
On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler dmah...@gmail.com wrote:
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all
by the input format.
Nick
On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler dmah...@gmail.com wrote:
Hi Nicholas,
Gzipping is a an impressive guess! Yes, they are.
My data sets are too large to make repartitioning viable, but I could try
it on a subset.
I generally have many more partitions
-release-engineering/
On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler dmah...@gmail.comwrote:
I am not an expert in this space either. I thought the initial rsync
during launch is really just a straight copy that did not need the tree
diff. So it seemed like having the slaves do the copying among
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote:
I agree that for updating rsync is probably preferable, and it seems like
for that purpose it would also parallelize well since most of the time is
spent computing checksums so the process is not constrained by the total
I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on
Setting up spark
RSYNC'ing /root/spark to slaves...
...
It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not
, but we
do want to minimize the complexity of our standard ec2 launch scripts to
reduce the chance of something breaking.
On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote:
I am launching a rather large cluster on ec2.
It seems like the launch is taking forever
I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be
24 matches
Mail list logo