Re: Should spark-ec2 get its own repo?

2015-07-14 Thread Matt Goodman
I concur with the things Sean said about keeping the same JIRA. Frankly, its a pretty small part of spark, and as mentioned by Nicholas, a reference implementation of getting Spark running in ec2. I can see wanting to grow it to a little more general tool that implements launchers for other compu

PySpark GroupByKey implementation question

2015-07-14 Thread Matt Cheah
Hi everyone, I was examining the Pyspark implementation of groupByKey in rdd.py. I would like to submit a patch improving Scala RDD¹s groupByKey that has a similar robustness against large groups, as Pyspark¹s implementation has logic to spill part of a single group to disk along the way. Its imp

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Burak, Thank you for explanation! I will try to make a diagonal block matrix and report you the results. Column- or row based partitioner make sense to me, because it is a direct analogy from column or row-based data storage in matrices, which is used in BLAS. Best regards, Alexander From

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread swetha
OK. Thanks a lot TD. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13231.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
BTW, this is more like a user-list kind of mail, than a dev-list. The dev-list is for Spark developers. On Tue, Jul 14, 2015 at 4:23 PM, Tathagata Das wrote: > 1. When you set ssc.checkpoint(checkpointDir), the spark streaming > periodically saves the state RDD (which is a snapshot of all the st

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app with updateStateByKey will not start until you set checkpoint directory. 2. The updateStateByKey

RestSubmissionClient Basic Auth

2015-07-14 Thread Joel Zambrano
Hi! We have a gateway with basic auth that relays calls to the head node in our cluster. Is adding support for basic auth the wrong approach? Should we use a relay proxy? I've seen the code and it would probably require adding a few configs and appending the header on the get and post request of

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread swetha
Hi TD, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed or do

Regarding sessionization with updateStateByKey

2015-07-14 Thread swetha
Hi, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed? Thanks,

Re: Spark Core and ways of "talking" to it for enhancing application language support

2015-07-14 Thread Vasili I. Galchin
thanks. On Tuesday, July 14, 2015, Shivaram Venkataraman wrote: > Both SparkR and the PySpark API call into the JVM Spark API (i.e. > JavaSparkContext, JavaRDD etc.). They use different methods (Py4J vs. the > R-Java bridge) to call into the JVM based on libraries available / features > supporte

Re: question related partitions of the DataFrame

2015-07-14 Thread Eugene Morozov
Gil, I’d say that DataFrame is a result of transformation of any other RDD. Your input RDD might contains strings and numbers. But as a result of transformation you end up with RDD that contains GenericRowWithSchema, which is what DataFrame actually is. So, I’d say that DataFrame is just sort

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Sean Busbey
Point well taken. Allow me to walk back a little and move us in a more productive direction. I can personally empathize with the desire to have nightly builds. I'm a passionate advocate for tight feedback cycles between a project and its downstream users. I am personally involved in several projec

Re: Contributiona nd choice of langauge

2015-07-14 Thread Feynman Liang
I would suggest starting with some starter tasks

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Mark Hamstra
> > Please keep in mind that you are also "ASF people," as is the entire Spark > community (users and all)[4]. Phrasing things in terms of "us and them" by > drawing a distinction on "[they] get in a fight on our mailing list" is not > helpful. But they started it! A bit more seriously, my perspe

Re: BlockMatrix multiplication

2015-07-14 Thread Burak Yavuz
Hi Alexander, >From your example code, using the GridPartitioner, you will have 1 column, and 5 rows. When you perform an A^T^A multiplication, you will generate a separate GridPartitioner with 5 columns and 5 rows. Therefore you are observing a huge shuffle. If you would generate a diagonal-block

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Sean Busbey
Responses inline, with some liberties on ordering. On Sun, Jul 12, 2015 at 10:32 PM, Patrick Wendell wrote: > Hey Sean B, > > Would you mind outlining for me how we go about changing this policy - > I think it's outdated and doesn't make much sense. Ideally I'd like to > propose a vote to modify

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Rakesh, I am not interested in a particular case of A^T*A. This case is a handy setup so I don’t need to create another matrix and force the blocks to co-locate. Basically, I am trying to understand the effectiveness of BlockMatrix for multiplication of distributed matrices. It seems that I

Re: BlockMatrix multiplication

2015-07-14 Thread Rakesh Chalasani
Hi Alexander: Aw, I missed the 'cogroup' on BlockMatrix multiply! I stand corrected. Check https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala#L361 BlockMatrix multiply uses a custom partiti

Re: Spark Core and ways of "talking" to it for enhancing application language support

2015-07-14 Thread Shivaram Venkataraman
Both SparkR and the PySpark API call into the JVM Spark API (i.e. JavaSparkContext, JavaRDD etc.). They use different methods (Py4J vs. the R-Java bridge) to call into the JVM based on libraries available / features supported in each language. So for Haskell, one would need to see what is the best

Re: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Rakesh, Thanks for suggestion. Each block of original matrix is in separate partition. Each block of transposed matrix is also in a separate partition. The partition numbers are the same for the blocks that undergo multiplication. Each partition is on a separate worker. Basically, I want to

Re: BlockMatrix multiplication

2015-07-14 Thread Rakesh Chalasani
Block matrix stores the data as key->Matrix pairs and multiply does a reduceByKey operations, aggregating matrices per key. Since you said each block is residing in a separate partition, reduceByKey might be effectively shuffling all of the data. A better way to go about this is to allow multiple b

Re: Contributiona nd choice of langauge

2015-07-14 Thread Rakesh Chalasani
Here is a more specific MLlib related Umbrella for 1.5 that can help you get started https://issues.apache.org/jira/browse/SPARK-8445?jql=text%20~%20%22mllib%201.5%22 Rakesh On Tue, Jul 14, 2015 at 6:52 AM Akhil Das wrote: > You can try to resolve some Jira issues, to start with try out some ne

Re: problems with build of latest the master

2015-07-14 Thread Gil Vernik
I figured it out. I tried to build Spark configured to access OpenStack Swift and hadoop-openstack.jar has the same issue as were described here https://github.com/apache/spark/pull/7090/commits So for those who wants to build Spark 1.5.- master with OpenStack Swift support, just remove mockito

Re: problems with build of latest the master

2015-07-14 Thread Ted Yu
Looking at Jenkins, master branch compiles. Can you try the following command ? mvn -Phive -Phadoop-2.6 -DskipTests clean package What version of Java are you using ? Cheers On Tue, Jul 14, 2015 at 2:23 AM, Gil Vernik wrote: > I just did checkout of the master and tried to build it with > >

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
You can try to resolve some Jira issues, to start with try out some newbie JIRA's. Thanks Best Regards On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71 < sreenivas.raghav...@gmail.com> wrote: > I saw the contribution sections. As a new contibutor, should I try to build > patches or can I ad

Re: Contributiona nd choice of langauge

2015-07-14 Thread srinivasraghavansr71
I saw the contribution sections. As a new contibutor, should I try to build patches or can I add some new algorithm to MLlib. I am comfortable with python and R. Are they enough to contribute for spark? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com

problems with build of latest the master

2015-07-14 Thread Gil Vernik
I just did checkout of the master and tried to build it with mvn -Dhadoop.version=2.6.0 -DskipTests clean package Got: [ERROR] /Users/gilv/Dev/Spark/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:117: error: cannot find symbol [ERROR] when(shuffleMemo

Re: question related partitions of the DataFrame

2015-07-14 Thread Gil Vernik
I see that most recent code doesn't has RDDApi anymore. But i still would like to understand the logic of partitions of DataFrame. Does DataFrame has it's own partitions and is sort of RDD by itself, or it depends on the partitions of the underline RDD that was used to load the data? For exampl

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
This will get you started https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71 < sreenivas.raghav...@gmail.com> wrote: > Hello everyone, >I am interested to contribute to apache spark