Re: Fixed:spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-06 Thread serkan.dogan
Thanks MLnick, I fixed the error. First i compile spark with original version later I download this pom file to examples folder https://github.com/tedyu/spark/commit/70fb7b4ea8fd7647e4a4ddca4df71521b749521c Then i recompile with maven. mvn -Dhbase.profile=hadoop-provided -Phadoop-2.4

Re: Dstream Transformations

2014-10-06 Thread Akhil Das
AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in that case if one worker node goes down, then spark will try to recompute those lost RDDs again with those workers who are alive. Thanks Best Regards On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar

Re: Dstream Transformations

2014-10-06 Thread Massimiliano Tomassi
From the Spark Streaming Programming Guide ( http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node ): *...output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Doesn't spark keep track of the DAG lineage and start from where it has stopped ? Does it have to always start from the beginning of the lineage when the job fails ? From: Massimiliano Tomassi [max.toma...@gmail.com] Sent: Monday, October 06, 2014 2:40 PM To:

Re: Using GraphX with Spark Streaming?

2014-10-06 Thread Jayant Shekhar
Arko, It would be useful to know more details on the use case you are trying to solve. As Tobias wrote, Spark Streaming works on DStream, which is a continuous series of RDDs. Do check out performance tuning :

Re: [SparkSQL] Function parity with Shark?

2014-10-06 Thread Yana Kadiyska
I have created https://issues.apache.org/jira/browse/SPARK-3814 https://issues.apache.org/jira/browse/SPARK-3815 Will probably try my hand at 3814, seems like a good place to get started... On Fri, Oct 3, 2014 at 3:06 PM, Michael Armbrust mich...@databricks.com wrote: Thanks for digging in!

Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, I would like to ask if it is possible to use generator, that generates data bigger than size of RAM across all the machines as the input for sc = SparkContext(), sc.paralelize(generator). I would like to create RDD this way. When I am trying to create RDD by sc.TextFile(file) where file

Re: Larger heap leads to perf degradation due to GC

2014-10-06 Thread Arun Ahuja
We have used the strategy that you suggested, Andrew - using many workers per machine and keeping the heaps small ( 20gb). Using a large heap resulted in workers hanging or not responding (leading to timeouts). The same dataset/job for us will fail (most often due to akka disassociated or fetch

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
sc.parallelize() to distribute a list of data into numbers of partitions, but generator can not be cut and serialized automatically. If you can partition your generator, then you can try this: sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x)) such as you want to generate

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
Hi again, Finally, I found the time to play around with your suggestions. Unfortunately, I noticed some unusual behavior in the MLlib results, which is more obvious when I compare them against their scikit-learn equivalent. Note that I am currently using spark 0.9.2. Long story short: I find it

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Sunny Khatri
One diff I can find is you may have different kernel functions for your training, In Spark, you end up using Linear Kernel whereas for scikit you are using rbk kernel. That can explain the different in the coefficients you are getting. On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais

Re: How to make ./bin/spark-sql work with hive?

2014-10-06 Thread Li HM
After disabled the client side authorization and no anything in the SPARK_CLASSPATH, I am still getting no class found error. property namehive.security.authorization.enabled/name valuefalse/value descriptionPerform authorization checks on the client/description /property Am I hitting a

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-06 Thread Michael Armbrust
No, not yet. Only Hive UDAFs are supported. On Mon, Oct 6, 2014 at 2:18 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, Does spark sql currently support user-defined custom aggregation function in scala like the way UDF defined with sqlContext.registerFunction? (not hive UDAF) Thanks, --

Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Abraham Jacob
Hi All, Would really appreciate from the community if anyone has implemented the saveAsNewAPIHadoopFiles method in Java found in the org.apache.spark.streaming.api.java.JavaPairDStreamK,V Any code snippet or online link would be greatly appreciated. Regards, Jacob

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-06 Thread Bharat Venkat
TD has addressed this. It should be available in 1.2.0. https://issues.apache.org/jira/browse/SPARK-3495 On Thu, Oct 2, 2014 at 9:45 AM, maddenpj madde...@gmail.com wrote: I am seeing this same issue. Bumping for visibility. -- View this message in context:

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Sean Owen
Here's an example: https://github.com/OryxProject/oryx/blob/master/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/BatchLayer.java#L131 On Mon, Oct 6, 2014 at 7:39 PM, Abraham Jacob abe.jac...@gmail.com wrote: Hi All, Would really appreciate from the community if anyone has implemented the

Is RDD partition index consistent?

2014-10-06 Thread Sung Hwan Chung
Is the RDD partition index you get when you call mapPartitionWithIndex consistent under fault-tolerance condition? I.e. 1. Say index is 1 for one of the partitions when you call data.mapPartitionWithIndex((index, rows) = ) // Say index is 1 2. The partition fails (maybe a long with a bunch

Re: Strategies for reading large numbers of files

2014-10-06 Thread Nicholas Chammas
Unfortunately not. Again, I wonder if adding support targeted at this small files problem would make sense for Spark core, as it is a common problem in our space. Right now, I don't know of any other options. Nick On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn lan...@janrain.com wrote:

Re: Is RDD partition index consistent?

2014-10-06 Thread Liquan Pei
Hi, The partition information for Spark master is not updated after a stage failure. In case of HDFS, Spark gets partition information from InputFormat and if a data node in HDFS is down when spark is performing computation for a certain stage, this stage will fail and be resubmitted using the

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to save it to the storage, but directly apply parallelize and

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-06 Thread Peng Cheng
Any suggestions? I'm thinking of submitting a feature request for mutable broadcast. Is it doable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html Sent from the Apache Spark

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Davies Liu
On Mon, Oct 6, 2014 at 1:08 PM, jan.zi...@centrum.cz wrote: Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-06 Thread Abraham Jacob
Sean, Thanks a ton Sean... This is exactly what I was looking for. As mentioned in the code - // This horrible, separate declaration is necessary to appease the compiler @SuppressWarnings(unchecked) Class? extends OutputFormat?,? outputFormatClass = (Class? extends OutputFormat?,?)

lazy evaluation of RDD transformation

2014-10-06 Thread anny9699
Hi, I see that this type of question has been asked before, however still a little confused about it in practice. Such as there are two ways I could deal with a series of RDD transformation before I do a RDD action, which way is faster: Way 1: val data = sc.textFile() val data1 = data.map(x =

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
@Davies I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the master node. Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Sean Owen
I think you mean that data2 is a function of data1 in the first example. I imagine that the second version is a little bit more efficient. But it is nothing to do with memory or caching. You don't have to cache anything here if you don't want to. You can cache what you like. Once memory for the

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote: I think you mean that data2 is a function of data1 in the

Spark Standalone on EC2

2014-10-06 Thread Ankur Srivastava
Hi, I have started a Spark Cluster on EC2 using Spark Standalone cluster manager but spark is trying to identify the worker threads using the hostnames which are not accessible publicly. So when I try to submit jobs from eclipse it is failing, is there some way spark can use IP address instead

How to make Spark-sql join using HashJoin

2014-10-06 Thread Benyi Wang
I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a maven repository. After put it into the classpath, I can use spark-sql in my application. One of issue is that I couldn't make the join as a hash join. It gives CartesianProduct when I join two SchemaRDDs as follows:

Re: [ANN] SparkSQL support for Cassandra with Calliope

2014-10-06 Thread tian zhang
Rohit, Thank you very much for release the H2 version and now my app compiles file and there is no more runtime error wrt. hadoop 1.x class or interface. Tian On Saturday, October 4, 2014 9:47 AM, Rohit Rai ro...@tuplejump.com wrote: Hi Tian, We have published a build against Hadoop 2.0

performance comparison: join vs cogroup?

2014-10-06 Thread freedafeng
For two large key-value data sets, if they have the same set of keys, what is the fastest way to join them into one? Suppose all keys are unique in each data set, and we only care about those keys that appear in both data sets. input data I have: (k, v1) and (k, v2) data I want to get from the

Spark 1.1.0 with Hadoop 2.5.0

2014-10-06 Thread hmxxyy
Does Spark 1.1.0 work with Hadoop 2.5.0? The maven build instruction only has command options up to hadoop 2.4. Anybody ever made it work? I am trying to run spark-sql with hive 0.12 on top of hadoop 2.5.0 but can't make it work. -- View this message in context: