Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Frank Austin Nothaft
wrong. This is especially true if the people you think are wrong are actually correct. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Aug 2, 2017, at 6:25 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > I am definitely sure

Re: real world spark code

2017-07-25 Thread Frank Austin Nothaft
<— core is java https://github.com/hail-is/hail <https://github.com/hail-is/hail> <— core is scala, mostly used through python wrappers neuroscience: https://github.com/thunder-project/thunder#using-with-spark <https://github.com/thunder-project/thunder#using-with-spark> <—

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor <https://github.com/BD2KGenomics/conductor> with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 20

Re: which database for gene alignment data ?

2015-06-10 Thread Frank Austin Nothaft
if we support that right now for feature data; those are fairly new. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 9, 2015, at 9:21 PM, roni roni.epi...@gmail.com wrote: Hi Frank, Thanks for the reply. I downloaded ADAM and built

Re: which database for gene alignment data ?

2015-06-08 Thread Frank Austin Nothaft
with ADAMContext.loadFeatures. We have two tools for the overlap computation: you can use a BroadcastRegionJoin if one of the datasets you want to overlap is small or a ShuffleRegionJoin if both datasets are large. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 8

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Frank Austin Nothaft
You’ll definitely want to use a Kryo-based serializer for Avro. We have a Kryo based serializer that wraps the Avro efficient serializer here. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 3, 2015, at 5:41 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Worker and Nodes

2015-02-21 Thread Frank Austin Nothaft
with more resources (memory capacity and bandwidth, and disk bandwidth). When you increase the number of tasks executing on a single node, you do not increase the pool of available resources. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 21, 2015, at 4:11

Re: K-Means final cluster centers

2015-02-05 Thread Frank Austin Nothaft
Unless I misunderstood your question, you’re looking for the val clusterCenters in http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel, no? Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 5, 2015, at 2

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Frank Austin Nothaft
, I would be glad to port it for the Spark core. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 22, 2015, at 7:11 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: Thanks Frank for your response. So, creating a custom InputFormat

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Frank Austin Nothaft
on in that project, so I’m not sure if it is the cleanest way, but it is a workable way. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jan 21, 2015, at 9:17 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: I am trying to solve similar problem. I am

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Frank Austin Nothaft
Shailesh, To add, are you packaging Hadoop in your app? Hadoop will pull in Guava. Not sure if you are using Maven (or what) to build, but if you can pull up your builds dependency tree, you will likely find com.google.guava being brought in by one of your dependencies. Regards, Frank Austin

Re: AVRO specific records

2014-11-05 Thread Frank Austin Nothaft
an Avro HadoopInputFormat. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Nov 5, 2014, at 1:25 PM, Simone Franzini captainfr...@gmail.com wrote: How can I read/write AVRO specific records? I found several snippets using generic records, but nothing

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread Frank Austin Nothaft
) will be added in Spark 1.0.1 (which is currently available, BTW). Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Sep 26, 2014, at 7:38 AM, matthes mdiekst...@sensenetworks.com wrote: Thank you Jey, That is a nice introduction but it is a may be to old

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread Frank Austin Nothaft
Matthes, Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType in Spark-SQL. I only use Spark, but it does looks like that should be supported in Spark-SQL 1.1.0. I’m not sure though if you can apply predicates on repeated items from Spark-SQL. Regards, Frank Austin

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-19 Thread Frank Austin Nothaft
Hi Mohan, It’s a bit convoluted to follow in their source, but they essentially typedef KSerializer as being a KryoSerializer, and then their serializers all extend KSerializer. Spark should identify them properly as Kryo Serializers, but I haven’t tried it myself. Regards, Frank Austin

Re: File I/O in spark

2014-09-15 Thread Frank Austin Nothaft
for local file access. This is used to implement the rdd.pipe method (IIRC), and we use it in some downstream apps to do IO with processes that we spawn from mapPartitions calls (see here and here). Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Frank Austin Nothaft
, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Sep 8, 2014, at 12:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh

Re: multiple passes in mapPartitions

2014-07-01 Thread Frank Austin Nothaft
Hi Zhen, The Scala iterator trait supports cloning via the duplicate method (http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator@duplicate:(Iterator[A],Iterator[A])). Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 13

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread FRANK AUSTIN NOTHAFT
Robert, You can build a Spark application using Maven for Hadoop 2 by adding a dependency on the Hadoop 2.* hadoop-client package. If you define any Hadoop Input/Output formats, you may also need to depend on the hadoop-mapreduce package. Regards, Frank Austin Nothaft fnoth...@berkeley.edu

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Frank Austin Nothaft
Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 29, 2014, at 4:20 PM, Robert James srobertja...@gmail.com wrote: On 6/29/14, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: Robert, You can build a Spark application using Maven for Hadoop 2 by adding a dependency

Re: initial basic question from new user

2014-06-12 Thread FRANK AUSTIN NOTHAFT
/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala, starting at line 62. There is a bit of setup necessary for the Parquet write codec, but otherwise it is fairly straightforward. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202

Re: Spark-ec2 asks for password

2014-04-18 Thread FRANK AUSTIN NOTHAFT
request; I haven't seen that on my end. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors

Re: Avro serialization

2014-04-03 Thread FRANK AUSTIN NOTHAFT
about this at http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell i...@ianoconnell.com wrote: Objects been transformed need to be one of these in flight