Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-06 Thread Sean Owen
That's a Hive version issue, not Hadoop version issue. On Tue, Oct 7, 2014 at 7:21 AM, Li HM wrote: > Thanks for the replied. > > Please refer to my another post entitled "How to make ./bin/spark-sql > work with hive". It has all the error/exceptions I am getting. > > If I understand you correctl

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-06 Thread Li HM
Thanks for the replied. Please refer to my another post entitled "How to make ./bin/spark-sql work with hive". It has all the error/exceptions I am getting. If I understand you correctly, I can build the package with mvn -Phive,hadoop-2.4 -Dhadoop.version=2.5.0 clean package This is what I actua

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-06 Thread Sean Owen
The hadoop-2.4 profile is really intended to be "Hadoop 2.4+". It should compile and run fine with Hadoop 2.5 as far as I know. CDH 5.2 is Hadoop 2.5 + Spark 1.1, so there is evidence it works. You didn't say what doesn't work. On Tue, Oct 7, 2014 at 6:07 AM, hmxxyy wrote: > Does Spark 1.1.0 work

Spark 1.1.0 with Hadoop 2.5.0

2014-10-06 Thread hmxxyy
Does Spark 1.1.0 work with Hadoop 2.5.0? The maven build instruction only has command options up to hadoop 2.4. Anybody ever made it work? I am trying to run spark-sql with hive 0.12 on top of hadoop 2.5.0 but can't make it work. -- View this message in context: http://apache-spark-user-lis

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Otis Gospodnetic
Hi, The other option to consider is using G1 GC, which should behave better with large heaps. But pointers are not compressed in heaps > 32 GB in size, so you may be better off staying under 32 GB. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearc

performance comparison: join vs cogroup?

2014-10-06 Thread freedafeng
For two large key-value data sets, if they have the same set of keys, what is the fastest way to join them into one? Suppose all keys are unique in each data set, and we only care about those keys that appear in both data sets. input data I have: (k, v1) and (k, v2) data I want to get from the

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Mingyu Kim
Ok, cool. This seems to be general issues in JVM with very large heaps. I agree that the best workaround would be to keep the heap size below 32GB. Thanks guys! Mingyu From: Arun Ahuja Date: Monday, October 6, 2014 at 7:50 AM To: Andrew Ash Cc: Mingyu Kim , "user@spark.apache.org" , Dennis

Re: [ANN] SparkSQL support for Cassandra with Calliope

2014-10-06 Thread tian zhang
Rohit, Thank you very much for release the H2 version and now my app compiles file and there is no more runtime error wrt. hadoop 1.x class or interface. Tian On Saturday, October 4, 2014 9:47 AM, Rohit Rai wrote: Hi Tian, We have published a build against Hadoop 2.0 with version 1.1.0-CT

Re: How to make Spark-sql join using HashJoin

2014-10-06 Thread Liquan Pei
In Spark 1.0, outer join are resolved to BroadcastNestedLoopJoin. You can use Spark 1.1 which resolves outer join to hash joins. Hope this helps! Liqua On Mon, Oct 6, 2014 at 4:20 PM, Benyi Wang wrote: > I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in > clouder'a maven reposit

How to make Spark-sql join using HashJoin

2014-10-06 Thread Benyi Wang
I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a maven repository. After put it into the classpath, I can use spark-sql in my application. One of issue is that I couldn't make the join as a hash join. It gives CartesianProduct when I join two SchemaRDDs as follows: scal

Spark Standalone on EC2

2014-10-06 Thread Ankur Srivastava
Hi, I have started a Spark Cluster on EC2 using Spark Standalone cluster manager but spark is trying to identify the worker threads using the hostnames which are not accessible publicly. So when I try to submit jobs from eclipse it is failing, is there some way spark can use IP address instead of

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen wrote: > I think you mean that data2 is a function of data1 in the first > example. I ima

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Sean Owen
I think you mean that data2 is a function of data1 in the first example. I imagine that the second version is a little bit more efficient. But it is nothing to do with memory or caching. You don't have to cache anything here if you don't want to. You can cache what you like. Once memory for the ca

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
@Davies I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the master node. Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind of

Re: Using FunSuite to test Spark throws NullPointerException

2014-10-06 Thread Mario Pastorelli
The problem was mine: I was using FunSuite in the wrong way. The test method of FunSuite registers a "test" to be triggered when the tests are running. My localTest is instead creating and stopping the SparkContext during the test registration and as result my SparkContext is stopped when the t

Re: Strategies for reading large numbers of files

2014-10-06 Thread Matei Zaharia
The problem is that listing the metadata for all these files in S3 takes a long time. Something you can try is the following: split your files into several non-overlapping paths (e.g. s3n://bucket/purchase/2014/01, s3n://bucket/purchase/2014/02, etc), then do sc.parallelize over a list of such

lazy evaluation of RDD transformation

2014-10-06 Thread anny9699
Hi, I see that this type of question has been asked before, however still a little confused about it in practice. Such as there are two ways I could deal with a series of RDD transformation before I do a RDD action, which way is faster: Way 1: val data = sc.textFile() val data1 = data.map(x => f1

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Abraham Jacob
Sean, Thanks a ton Sean... This is exactly what I was looking for. As mentioned in the code - // This horrible, separate declaration is necessary to appease the compiler @SuppressWarnings("unchecked") Class> outputFormatClass = (Class>) (Class) SequenceFileOutputFormat.class; wri

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop InputFormat

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
On Mon, Oct 6, 2014 at 1:08 PM, wrote: > Hi, > > > Thank you for your advice. It really might work, but to specify my problem a > bit more, think of my data more like one generated item is one parsed > wikipedia page. I am getting this generator from the parser and I don't want > to save it to th

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-06 Thread Peng Cheng
Any suggestions? I'm thinking of submitting a feature request for mutable broadcast. Is it doable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html Sent from the Apache Spark User

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to save it to the storage, but directly apply parallelize and crea

Re: Is RDD partition index consistent?

2014-10-06 Thread Liquan Pei
Hi, The partition information for Spark master is not updated after a stage failure. In case of HDFS, Spark gets partition information from InputFormat and if a data node in HDFS is down when spark is performing computation for a certain stage, this stage will fail and be resubmitted using the sam

Re: Strategies for reading large numbers of files

2014-10-06 Thread Nicholas Chammas
Unfortunately not. Again, I wonder if adding support targeted at this "small files problem" would make sense for Spark core, as it is a common problem in our space. Right now, I don't know of any other options. Nick On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn wrote: > Nicholas, thanks for the

Is RDD partition index consistent?

2014-10-06 Thread Sung Hwan Chung
Is the RDD partition index you get when you call mapPartitionWithIndex consistent under fault-tolerance condition? I.e. 1. Say index is 1 for one of the partitions when you call data.mapPartitionWithIndex((index, rows) => ) // Say index is 1 2. The partition fails (maybe a long with a bunch o

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Sean Owen
Here's an example: https://github.com/OryxProject/oryx/blob/master/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/BatchLayer.java#L131 On Mon, Oct 6, 2014 at 7:39 PM, Abraham Jacob wrote: > Hi All, > > Would really appreciate from the community if anyone has implemented the > saveAsNewAPIHad

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-06 Thread Bharat Venkat
TD has addressed this. It should be available in 1.2.0. https://issues.apache.org/jira/browse/SPARK-3495 On Thu, Oct 2, 2014 at 9:45 AM, maddenpj wrote: > I am seeing this same issue. Bumping for visibility. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.

combining python and java in a single Spark application

2014-10-06 Thread Yadid Ayzenberg
Hello, This may seem like a strange question, but is it possible to combine python and java code in a single application. i.e. I would like to create the spark context in java and then perform RDD transformations in python, or use parts of MLlib in python and others in Java. one solution that

Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Abraham Jacob
Hi All, Would really appreciate from the community if anyone has implemented the saveAsNewAPIHadoopFiles method in "Java" found in the org.apache.spark.streaming.api.java.JavaPairDStream Any code snippet or online link would be greatly appreciated. Regards, Jacob

Re: Strategies for reading large numbers of files

2014-10-06 Thread Landon Kuhn
Nicholas, thanks for the tip. Your suggestion certainly seemed like the right approach, but after a few days of fiddling I've come to the conclusion that s3distcp will not work for my use case. It is unable to flatten directory hierarchies, which I need because my source directories contain hour/mi

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-06 Thread Michael Armbrust
No, not yet. Only Hive UDAFs are supported. On Mon, Oct 6, 2014 at 2:18 AM, Pei-Lun Lee wrote: > Hi, > > Does spark sql currently support user-defined custom aggregation function > in scala like the way UDF defined with sqlContext.registerFunction? (not > hive UDAF) > > Thanks, > -- > Pei-Lun >

Re: Using FunSuite to test Spark throws NullPointerException

2014-10-06 Thread Matei Zaharia
Weird, it seems like this is trying to use the SparkContext before it's initialized, or something like that. Have you tried unrolling this into a single method? I wonder if you just have multiple versions of these libraries on your classpath or something. Matei On Oct 4, 2014, at 1:40 PM, Mari

Re: How to make ./bin/spark-sql work with hive?

2014-10-06 Thread Li HM
After disabled the client side authorization and no anything in the SPARK_CLASSPATH, I am still getting no class found error. hive.security.authorization.enabled false Perform authorization checks on the client Am I hitting a dead end? Please help. spark-sql> use mydb; OK Time taken: 4.5

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Sunny Khatri
One diff I can find is you may have different kernel functions for your training, In Spark, you end up using Linear Kernel whereas for scikit you are using rbk kernel. That can explain the different in the coefficients you are getting. On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais < adamantio

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
Hi again, Finally, I found the time to play around with your suggestions. Unfortunately, I noticed some unusual behavior in the MLlib results, which is more obvious when I compare them against their scikit-learn equivalent. Note that I am currently using spark 0.9.2. Long story short: I find it di

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
sc.parallelize() to distribute a list of data into numbers of partitions, but generator can not be cut and serialized automatically. If you can partition your generator, then you can try this: sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x)) such as you want to generate xrange

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Arun Ahuja
We have used the strategy that you suggested, Andrew - using many workers per machine and keeping the heaps small (< 20gb). Using a large heap resulted in workers hanging or not responding (leading to timeouts). The same dataset/job for us will fail (most often due to akka disassociated or fetch

Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, I would like to ask if it is possible to use generator, that generates data bigger than size of RAM across all the machines as the input for sc = SparkContext(), sc.paralelize(generator). I would like to create RDD this way. When I am trying to create RDD by sc.TextFile(file) where file ha

Re: [SparkSQL] Function parity with Shark?

2014-10-06 Thread Yana Kadiyska
I have created https://issues.apache.org/jira/browse/SPARK-3814 https://issues.apache.org/jira/browse/SPARK-3815 Will probably try my hand at 3814, seems like a good place to get started... On Fri, Oct 3, 2014 at 3:06 PM, Michael Armbrust wrote: > Thanks for digging in! These both look like t

Re: Using GraphX with Spark Streaming?

2014-10-06 Thread Jayant Shekhar
Arko, It would be useful to know more details on the use case you are trying to solve. As Tobias wrote, Spark Streaming works on DStream, which is a continuous series of RDDs. Do check out performance tuning : https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tunin

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Doesn't spark keep track of the DAG lineage and start from where it has stopped ? Does it have to always start from the beginning of the lineage when the job fails ? From: Massimiliano Tomassi [max.toma...@gmail.com] Sent: Monday, October 06, 2014 2:40 PM To: Jah

Spark SQL - custom aggregation function (UDAF)

2014-10-06 Thread Pei-Lun Lee
Hi, Does spark sql currently support user-defined custom aggregation function in scala like the way UDF defined with sqlContext.registerFunction? (not hive UDAF) Thanks, -- Pei-Lun

Re: Stucked job works well after rdd.count or rdd.collect

2014-10-06 Thread Kevin Jung
additional logs here. 14/10/06 18:06:16 DEBUG HeartbeatReceiver: [actor] received message Heartbeat(11,[Lscala.Tuple2;@630ff939,BlockManagerId(11, cluster12, 48053, 0)) from Actor[akka.tcp://sparkExecutor@cluster12:57686/temp/$e] 14/10/06 18:06:16 DEBUG BlockManagerMasterActor: [actor] received me

Re: Dstream Transformations

2014-10-06 Thread Massimiliano Tomassi
>From the Spark Streaming Programming Guide ( http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node ): *...output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar From

Re: android + spark streaming?

2014-10-06 Thread Akhil Das
Hi You can basically have a rest api to collect the data (possibly store it in a NoSQL DB or put that in Kafka etc) from android devices and then you can use the Spark's Streaming feature to process these data. Thanks Best Regards On Sat, Oct 4, 2014 at 12:13 PM, ll wrote: > any comment/feedba

Re: Dstream Transformations

2014-10-06 Thread Akhil Das
AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in that case if one worker node goes down, then spark will try to recompute those lost RDDs again with those workers who are alive. Thanks Best Regards On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu < madhu

Re: Fixed:spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-06 Thread serkan.dogan
Thanks MLnick, I fixed the error. First i compile spark with original version later I download this pom file to examples folder https://github.com/tedyu/spark/commit/70fb7b4ea8fd7647e4a4ddca4df71521b749521c Then i recompile with maven. mvn -Dhbase.profile=hadoop-provided -Phadoop-2.4 -Dha